Threat Hunting Part 3: Going Hunting with Machine Learning

Due to being busy with proof of concepts at the end of the quarter, I’ve been on the prowl for lazy hunting ideas. Every security person’s dream is to have interesting data come to them, but is this possible? Apache Spark's MLlib seemed like a good place to start the hunt.

I wanted to leverage Apache Spark’s MLlib combined with in-bound and out-bound data to bubble up anomalous traffic talking to suspicious countries.*  At JASK we leverage machine learning to produce behavioral based signals and sometimes I decide to go hunting based on our raw data stacking instead of output from something like Spark+MLlib.  These notebooks are great way of pitting one behavioral based model vs. another approach, checking false positive rates and figuring out if your model is ready for prime time.

*Disclaimer - I’m an Security Engineer, not one of the ML experts on the team but I know where to find help!

Here’s how to do it:

Step 1: Import Apache’s Machine Learning Libraries with KMeans

Step 2: Formulate what data you want to run KMeans against.

Here I’m querying the sum of the outbound bytes.

Step 3: Convert the dataset to type RDD:array[double].

The resulting query above needs to be of type double for KMeans, so we map it to double. The reason for this is that num_en is a SchemaRDD. When you collect() on it, you get and Array[org.apache.spark.sql.Row]. Thus, num_en.collect()(0) gives you the first Row of the Array. The technical reason behind this is that a dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

Step 4: Define the number of classes.

Step 5: Define the number of iterations.

Step 6: Evaluate clustering by computing within set sum of squared errors.

This is where we evaluate the model and determine if it accurately represents the data.  I’m merely implementing this to get some anomalous traffic back as a test, and I am not a KMeans expert.

The key is to minimize the Euclidean distance among the points in the groups. The quadratic error is the Within Set Sum Squared Error (WSSE).

Step 7: Show the results.

Step 8: Save and load the model.

Step 9: Have your coworker look over your shoulder and tell you that Spark 2.0 has deprecated Apache’s MLlib!

BAAAAH!!! Defeated…except I wrote and implemented this a couple months ago and I’m just now getting to the point of writing the blog post around it (better late than never!). The above still works. It sounds like Spark has simply stopped maintaining the library in favor of other approaches, but it is still available for use. Essentially, no further maintenance is going to be performed on the MLlib. I’m not gonna let a good notebook go to waste due to a ML library being deprecated. It may be true that they are moving on from MLlib, but it doesn’t change the fact that hunting based on anomalies could be the place to start and K-Means is a helpful way to start that investigation.

Since Apache’s MLlib sounds like it’s going to be deprecated in the near future, I’m going to shift our goal of querying anomalies from MLlib to leveraging the machine learning anomaly detection within JASK Trident. This way I’ll know that by the time our user community reads this blog post, the underlying detection method will still be around.

To kick off this series of notebook paragraphs, I have a “signal” table. If you aren’t a  Trident customer, you’ll have to utilize your own Anomaly detection, such as the example implementation of APACHE MLIB shown above. If you are a Trident customer, you are in luck, because we expose our anomaly detection to users via the signals table.

Step 1: Here we are querying the Trident signal table for Anomaly detection. I’m storing the results of the query into an array that I will later automate the analysis.

The return type of the above query is a collection of arrays and in order to work with just the ip.address and feed a single IP at a time into our secondary analysis, we need to focus on grabbing each of the ip.addresses from the result.

Step 2: To solve the Array of Arrays problem we map each first item in the array to an Array of strings with the following easy paragraph.

Below you can see our result is now Array[String]. We are getting closer.

Step 3: Now that I have a clean list of suspicious IP addresses, I can throw my list of IP addresses through any type of secondary analysis and find out what these IP’s were talking about. What destination ports were they talking over? What website were they sending/receiving the data from? Heck, now I have a suspicious list of IP’s, I can perform an auto-analysis of them and then write these out to my firewall and put a block rule in for each one. All of those types of questions and actions I could throw into a paragraph and then understand WHY my anomalies were being generated.

Here’s  a quick look at the countries involved in my Anomalies:

My first question will be simple: How much traffic was being sent or received between the two hosts?

My second question will be based on the result that a set of anomalies were around destination port 80. What website URLs were being requested?

What I found from the above query was no results for one of the hosts!!

That led me to ask more about how many bytes were transferred:

Other paragraphs could ask similar questions based on the type of anomaly that was generated. What country was the traffic was destined for? What was the HOST that was requested? Was this related to data-exfiltration? What were the DNS queries? What were the DNS answers?  I now have a notebook for auto-analysis to determine if all those anomalies on your network were real threats or just users binge watching World Cup Skiing, (which happens to be what I am currently watching). Once you are comfortable with your auto-analysis you are free to export that to a file or write it to HDFS and use elsewhere, closing the feedback loop back inside the product for any future detections.



On the Hunt Part 2: Identifying Spear-Phishing Recon Activity-Collection of User Details with Ads for Spear Phishing Campaigns

A few weeks ago, I published a Base64 decoding article. The findings from this ranged from process ID numbers, application and version detection, to the blatant collection of email addresses. With that in mind, today I’m going to focus on Ads. Not adware, not malvertising, but just ads. Ads are the massive security hole in our network and the invasive species of our personal lives. I’m focusing on ads for the operational efficacy during the Reconnaissance Phase to support a strong spear phishing attack – Inspired by the Grizzly Steppe news.

I see targeted advertising happening day in and day out, but how much personally identifiable information is being collected about users? How much of the internet has become consumed by ads? How can we tell heads from tails or good from bad ads? I’m casting my net wide to catch what attackers might use as shrimp (bait) by spear phishing attackers later. With my morning coffee in hand and a comfy seat on the office couch, here is the notebook we are going to start with.

Base64 Decoding Emails

This is a slightly modified version of the Base64 Decoding Blog post from last week with a focus on decoded strings that have emails in them. (By adding a simple search for the “@” symbol in the decoded URI string.)

I would love to show you the results, but in the interest of protecting users emails, I’m going to ask you to trust me, there are hits. I’m not saying buckets of emails are leaving the network, but a handful of the most clearly identifiable pieces of information ([email protected] and [email protected]) are being collected by marketing and ad agencies around the world (KR, SG, US).

What information can you glean from an email? I’m easily able to identify where all of these people work based on the email domain. With the gmail, yahoo, and other webmail addresses I’m also able to identify login portals to imitate when I aim my exploit and go spear phishing. With emails in hand, we’ve proven the first step that emails are being collected by ads. Remember, this is being done at the ad agency level to supply their customers with perfectly targeted ads and to learn as much about the customer as possible to best target an ad. Only good things come from targeted ads :)

Based on the data being collected by the ad agency, I as an attacker/customer can request the ad agency target everyone that works at (Company X) with my exploit. Job complete, why am I even writing this post? Why do attackers even waste time with Recon? Ad agencies do real-time human tracking as a core competency and business. They track and categorize the human across mobile, TV, and PC. They are really good at what they do.

Maybe I’m being paranoid. I need more facts to backup my previous thoughts of *innocent ad companies tracking everyone. Maybe users are traversing the dark web or risky websites that would host malicious ads and I’m making assumptions. Maybe, it’s one country targeting users, like Russia? How hard is it to gain attribution? Let’s go to our data:  

Top 10 Non-US Traffic Destinations:

You can clearly see a large portion of traffic is going to KR (South Korea) and you might praise yourself, “Aha, the South Koreans are after me! It really was my users going to risky websites”. Don’t - Ask the data please: Let's pivot to the notebook for suspicious countries and analyze http requests. For this we are going to leverage our URL parsing UDF and parse out the domain name for quick and easy viewing of what all this KR traffic is about.

What is The Traffic to South Korea.

Traffic to South Korea Query:

100% of my South Korea traffic is going to Yahoo.

The data shows users weren’t going to risky websites when they went to South Korea (They went to Yahoo). What the data supports is that the internet is a global service and ads are hosted from around the world. Putting this together, you start to see why many of the IOC’s in the Grizzly Steppe release are all over the world and from trusted sources, It’s impossible to gain attribution to an attack source based wholly on network traffic GEO source or destination, notice Russia is nowhere to be scene and I actually haven’t seen as much Russian traffic lately as I had in previous years.

As an attacker, I could leverage an ad or marketing agency to pinpoint exactly who I wanted to target. The ad is not malware, it’s not malicious (I’ll debate an ad tracking me from my phone, to my PC, to my TV as being malicious), but it’s not illegal, unfortunately. It’s highly efficient Reconnaissance and attackers will and should be take advantage of this service.

Why port scan, URL crawl, or use Recon-ng, when I can pay an ad network to supply me with everything I need? An attacker can sit quietly on the sidelines prepping his exploit and hiring out the recon to ad agencies. This makes the discovery of any Reconnaissance Phase difficult, an attacker can now jump straight to the Delivery Phase of the Kill Chain by leveraging ad agencies.

Expanding on the idea of a global internet and ads being hosted around the world. What is the main source and subject of international traffic? What is Yahoo delivering to me from South Korea? I live in San Francisco, why does so much of my traffic get processed and delivered by another country? What is coming from these international locations? To help answer these questions:

Top 10 Non US HTTP Domains


Top 10 Non-US HTTP Domains

Ads, ads, ads, and more ads. smartadserver, lijit, stickyadstv, adsrvr, bluekai, google-analytics, all of it ads. This new internet is depressing me. It seems the majority of international traffic is ad networks.

In business, these collected emails, user-id’s and application version detections are used to display “relevant” ads for things I’m never going to buy, but thanks for your effort. In an attack scenario, that same data will be used to determine what sites Daniel visits regularly, where he has other accounts at, and general awareness of his lifestyle. With that information in hand, my spear phishing campaign is beginning to look closer to spear fishing in a stocked pond.

Let me step back to Grizzly Steppe and shared hosting. Many of the IOCs in Grizzly-Steppe report were on domains like yahoo, BlueOcean, and multi-tenancy platforms. Does our data tell us a story about this?

Top 10 Destination Organizations:

Top 10 Destination Organizations:

The Top 10 Destination Organizations are advertising providers and are ALL the major platforms for advertising, lead generation, and marketing providers. AppNexus is hosting adnxs, Google is googlesyndication, Amazon is hosting springserve, Akamai is fronting taboola. Thinking about this it’s not a surprise. Every webpage has a dozen ads, so the ratio of good clean internet traffic vs ads gets washed out - come to think of it, I could ask my data for the real answer of average connections per webpage request and maybe that’s a good indicator of risky websites? Ah, I’m an idea machine that never stops producing ideas, I’m going to work on that, but at some point this blog post must end, because it’s Friday and my coffee is now cold and not in a good cold-brew type of way.

What is possible with targeted ads?

Outside of Grizzly Steppe and the DNC attack, let's bring this closer to home with a real world example. If you work at any publicly traded company, it’s predictable that employees of the company will go to or to check the company's stock ticker and see how the shares of the company are doing. Sounds reasonable.

I’m going to hire the ad agency to target my ad with a few items - Users that currently have a vulnerable version of an application running (detected by the ad-agencies PID and application version detection). Target only users that work at the target organization. The ad agency knows who I work for, because they’ve tracked the websites I’ve visited for months and years as they host ads on most of the internet, think facebook ads, but also based on the src_ip.address GEO organization. Further pinpoint this ad directly to [email protected], because you’ve captured his email through a previous ad-campaign. Now tie all of these meta-data pieces together and fire off my ad. I could also reverse the ad campaign for a single targeted email and request to give me the Sales VP’s email who works at Jask. The ad agency has a collection of emails at the target organization and has auto-enhanced these emails with the person's title and importance within an organization with a site such as Linked-In.

How about an email that reads similar to this one? “This month's ESPP paperwork needs to be electronically-signed, please login to the link provided or open the attachment, sign, and respond in order to approve this quarters shares. This must be completed by Friday as we did not receive your response to our previous email.”  

We target this email at users within the target organization with a known application weakness (from the ad agencies collection of running processes). Sip some coffee….get a little anxious….sip some more coffee…profit.

-Poor internet

**My apologies for the wordy blog post this week. My mind was running and my fast fingers wouldn’t stop typing.