On the Hunt Part 2: Identifying Spear-Phishing Recon Activity-Collection of User Details with Ads for Spear Phishing Campaigns

A few weeks ago, I published a Base64 decoding article. The findings from this ranged from process ID numbers, application and version detection, to the blatant collection of email addresses. With that in mind, today I’m going to focus on Ads. Not adware, not malvertising, but just ads. Ads are the massive security hole in our network and the invasive species of our personal lives. I’m focusing on ads for the operational efficacy during the Reconnaissance Phase to support a strong spear phishing attack – Inspired by the Grizzly Steppe news.

I see targeted advertising happening day in and day out, but how much personally identifiable information is being collected about users? How much of the internet has become consumed by ads? How can we tell heads from tails or good from bad ads? I’m casting my net wide to catch what attackers might use as shrimp (bait) by spear phishing attackers later. With my morning coffee in hand and a comfy seat on the office couch, here is the notebook we are going to start with.

Base64 Decoding Emails

This is a slightly modified version of the Base64 Decoding Blog post from last week with a focus on decoded strings that have emails in them. (By adding a simple search for the “@” symbol in the decoded URI string.)

I would love to show you the results, but in the interest of protecting users emails, I’m going to ask you to trust me, there are hits. I’m not saying buckets of emails are leaving the network, but a handful of the most clearly identifiable pieces of information ([email protected] and [email protected]) are being collected by marketing and ad agencies around the world (KR, SG, US).

What information can you glean from an email? I’m easily able to identify where all of these people work based on the email domain. With the gmail, yahoo, and other webmail addresses I’m also able to identify login portals to imitate when I aim my exploit and go spear phishing. With emails in hand, we’ve proven the first step that emails are being collected by ads. Remember, this is being done at the ad agency level to supply their customers with perfectly targeted ads and to learn as much about the customer as possible to best target an ad. Only good things come from targeted ads :)

Based on the data being collected by the ad agency, I as an attacker/customer can request the ad agency target everyone that works at Jask.io (Company X) with my exploit. Job complete, why am I even writing this post? Why do attackers even waste time with Recon? Ad agencies do real-time human tracking as a core competency and business. They track and categorize the human across mobile, TV, and PC. They are really good at what they do.

Maybe I’m being paranoid. I need more facts to backup my previous thoughts of *innocent ad companies tracking everyone. Maybe users are traversing the dark web or risky websites that would host malicious ads and I’m making assumptions. Maybe, it’s one country targeting users, like Russia? How hard is it to gain attribution? Let’s go to our data:  

Top 10 Non-US Traffic Destinations:

You can clearly see a large portion of traffic is going to KR (South Korea) and you might praise yourself, “Aha, the South Koreans are after me! It really was my users going to risky websites”. Don’t - Ask the data please: Let's pivot to the notebook for suspicious countries and analyze http requests. For this we are going to leverage our URL parsing UDF and parse out the domain name for quick and easy viewing of what all this KR traffic is about.

What is The Traffic to South Korea.

Traffic to South Korea Query:

100% of my South Korea traffic is going to Yahoo.

The data shows users weren’t going to risky websites when they went to South Korea (They went to Yahoo). What the data supports is that the internet is a global service and ads are hosted from around the world. Putting this together, you start to see why many of the IOC’s in the Grizzly Steppe release are all over the world and from trusted sources, It’s impossible to gain attribution to an attack source based wholly on network traffic GEO source or destination, notice Russia is nowhere to be scene and I actually haven’t seen as much Russian traffic lately as I had in previous years.

As an attacker, I could leverage an ad or marketing agency to pinpoint exactly who I wanted to target. The ad is not malware, it’s not malicious (I’ll debate an ad tracking me from my phone, to my PC, to my TV as being malicious), but it’s not illegal, unfortunately. It’s highly efficient Reconnaissance and attackers will and should be take advantage of this service.

Why port scan, URL crawl, or use Recon-ng, when I can pay an ad network to supply me with everything I need? An attacker can sit quietly on the sidelines prepping his exploit and hiring out the recon to ad agencies. This makes the discovery of any Reconnaissance Phase difficult, an attacker can now jump straight to the Delivery Phase of the Kill Chain by leveraging ad agencies.

Expanding on the idea of a global internet and ads being hosted around the world. What is the main source and subject of international traffic? What is Yahoo delivering to me from South Korea? I live in San Francisco, why does so much of my traffic get processed and delivered by another country? What is coming from these international locations? To help answer these questions:

Top 10 Non US HTTP Domains


Top 10 Non-US HTTP Domains

Ads, ads, ads, and more ads. smartadserver, lijit, stickyadstv, adsrvr, bluekai, google-analytics, all of it ads. This new internet is depressing me. It seems the majority of international traffic is ad networks.

In business, these collected emails, user-id’s and application version detections are used to display “relevant” ads for things I’m never going to buy, but thanks for your effort. In an attack scenario, that same data will be used to determine what sites Daniel visits regularly, where he has other accounts at, and general awareness of his lifestyle. With that information in hand, my spear phishing campaign is beginning to look closer to spear fishing in a stocked pond.

Let me step back to Grizzly Steppe and shared hosting. Many of the IOCs in Grizzly-Steppe report were on domains like yahoo, BlueOcean, and multi-tenancy platforms. Does our data tell us a story about this?

Top 10 Destination Organizations:

Top 10 Destination Organizations:

The Top 10 Destination Organizations are advertising providers and are ALL the major platforms for advertising, lead generation, and marketing providers. AppNexus is hosting adnxs, Google is googlesyndication, Amazon is hosting springserve, Akamai is fronting taboola. Thinking about this it’s not a surprise. Every webpage has a dozen ads, so the ratio of good clean internet traffic vs ads gets washed out - come to think of it, I could ask my data for the real answer of average connections per webpage request and maybe that’s a good indicator of risky websites? Ah, I’m an idea machine that never stops producing ideas, I’m going to work on that, but at some point this blog post must end, because it’s Friday and my coffee is now cold and not in a good cold-brew type of way.

What is possible with targeted ads?

Outside of Grizzly Steppe and the DNC attack, let's bring this closer to home with a real world example. If you work at any publicly traded company, it’s predictable that employees of the company will go to finance.yahoo.com or finance.google.com to check the company's stock ticker and see how the shares of the company are doing. Sounds reasonable.

I’m going to hire the ad agency to target my ad with a few items - Users that currently have a vulnerable version of an application running (detected by the ad-agencies PID and application version detection). Target only users that work at the target organization. The ad agency knows who I work for, because they’ve tracked the websites I’ve visited for months and years as they host ads on most of the internet, think facebook ads, but also based on the src_ip.address GEO organization. Further pinpoint this ad directly to [email protected], because you’ve captured his email through a previous ad-campaign. Now tie all of these meta-data pieces together and fire off my ad. I could also reverse the ad campaign for a single targeted email and request to give me the Sales VP’s email who works at Jask. The ad agency has a collection of emails at the target organization and has auto-enhanced these emails with the person's title and importance within an organization with a site such as Linked-In.

How about an email that reads similar to this one? “This month's ESPP paperwork needs to be electronically-signed, please login to the link provided or open the attachment, sign, and respond in order to approve this quarters shares. This must be completed by Friday as we did not receive your response to our previous email.”  

We target this email at users within the target organization with a known application weakness (from the ad agencies collection of running processes). Sip some coffee….get a little anxious….sip some more coffee…profit.

-Poor internet

**My apologies for the wordy blog post this week. My mind was running and my fast fingers wouldn’t stop typing.

From Targeted Attack to Rapid Detection

From Targeted Attack to Rapid Detection

Yesterday I was hit with a targeted phishing email that was incredibly good. The email was terse and had a 7 hour time window for which I needed to open the attachment and verify the invoice. The attachment was named after me and even came from a valid business domain.  Simple yet effective and no broken English. It looked good, minus one thing….nobody ever wins free money and if you want me to send you money, I’m sure you will call me and not password protect my invoice.

I’ve been extremely cautious with email based on the recent Gmail phishing data-uri technique, and this email fell on me while I was alert. What to do? Turn this email into a handful of signals and feature vectors for JASK, lets get to it.

 Step One: I searched the web for a match of the file hash and nothing came up. Not surprising. Still, evidence is evidence and I put this into JASK as a piece of threat intel.

Step Two: Using oledump I checked out what might be inside.

No macros were showing up in the file. Maybe that’s because it’s password protected? I’m no Word file or oledump expert and my goal is to quickly transform this to actionable intelligence. I tried unzipping as well and received what looks like a corrupted file message:


Since EncryptedPackage seemed like an interesting string in the file, I decided to start focusing on detecting encrypted Word Doc files in JASK with a Yara signature.

I settled on the EncryptionTransform piece in the file and pulled out the hex for its equivalent:


My goal would be to detect Word files with EncryptionTransform in them. Maybe I’ll get a lot of false positives, but I’ll let JASK handle the decision-making process.

Step 3: Write the Yara signature for detecting encrypted Word files:

I originally predicted this file would have a malicious macro in it and I wanted to find word files with macros in them. A yara signature was already floating on the web and can be read about in the link provided. This saved me some work on writing that signature.

Step 4: One last piece of evidence was the SMTP headers from. I figured why not, the more evidence I can pile into JASK the better. First, I would prototype something in notebooks and see what my SMTP meta-data fields looked like. The headers.mailfrom field I would search for my phishing attempt sender. To protect the compromised business and users email and prevent more spam or targeted phishing attempts, I’ve replaced some IP’s and email addresses with my own for this screenshot.

The results of this modified query searching for the from address, results show six other hosts on our network received a number of emails from this specific sender.  A possible sign of emails that hit the spam filter or other security devices and maybe just my slick well crafted email made it through.

Once I have my query completed, I can quickly turn this headers.from address into a pattern and give it an initial weight and kill-chain attribute for JASK to use.

I’m done. I’ve added a handful of feature vectors for identifying this phishing attempt. I casted my net wide to create signals matching macros and encrypted Word documents and some specific signals to match file hashes and the sender related to this specific attack.

Lesson of the day? Don’t sleep on intelligence. If your users are going to get phished, you need to rapidly turn as many features of that attempt into actionable intelligence to have an early warning next time.

IDS Autopsy


If you’re in the IT security industry, you’ve certainly heard that IDS is dead. It’s funny to hear technology personified this way. Someone call 1110001111! The thought of a security technology expiring can be daunting, are we losing the battle? What’s the cause of death? We can include in the list of causes the high numbers of false-positives, lack-of security analysts, non-performant hardware etc. However in this post we’ll examine IDS evasion as one of them that contributed to its demise. If you’ve been around the security industry for a while, you’ll surely remember some of the techniques that we’ll explore in the Cat and Mouse section. After the nostalgia party, let’s change perspectives towards the future and take an optimistic look to what’s in store for the cyber warriors out there.

IDS can mean a lot of different things. Before we get started let’s disambiguate a bit. This article deals with network intrusion detection systems (NIDS). You know, the devices that hang out on the edge of your network, snooping on everything and sending you tons of alerts. It’s the one screaming at you to pay it attention, like the car alarm in the parking lot at 2:00am.

Cat and Mouse

Circumventing IDS has been a game played for decades. Below I’ll reflect on some of the techniques that were successfully used to sidestep IDS detection. This is important to understand; the ease of avoiding security controls is a primary reason for the death of a security technology.

Obfuscation Fragmentation

An obvious way to sidestep a network IDS is to simply scramble up the data in such a way that’s easy for the recipient to deal with, but difficult for the IDS. For instance, if a signature is looking for the phrase, “Command: Download all credit-card data”, we can simply break it into chunks, (“ownl”, “all cr”, “edit-c”, “Comm”, “ard da”, “and: “, “D”, “oad ”, “ta”) before sending it along out of order. At first this was done at layer four, the chunks were TCP/UDP datagrams. This is effective because the receiver of this message can easily reassemble it. If the receiver can’t, signaling for a retransmission is trivial. A network IDS that is listening to every conversation on the network cannot easily keep up and perform the reassembly required. IDS preprocessors were introduced to deal with this. With a layer-four statefull preprocessor, the IDS could reassemble the datagrams in the right order before checking for any pattern matches. We’re all fixed, right? Not quite yet, why not break the datagrams into smaller pieces and scramble them too? IP fragmentation is another technique to do the same thing, but now at layer 3.

This type of fragmentation technique requires low-level access to the operating system to perform, making it a heavier lift. It’s why we only see this technique used for attack traffic, vs. command-and-control, where much easier channels are available. It’s also easy to prevent: IP fragmentation on modern networks is very unusual and typically indicates a network configuration problem. Blocking IP fragments is safe to do and will prevent layer 3 fragmentation from being effective.

Special Encoding - Directory Traversal

Web directory traversal was a popular way to allow a hostile web client to reach outside the bounds of the remote server’s filesystem. The technique allows attackers to not only access files that were not intended for anonymous web-client access, but also interact with servers in naughty ways. The technique manifested as a web request with a large number of parent directory paths chained together. Generally it looked something like this:

Those dot-dots are a relative link pointing to the parent directory; stringing together several parent links would allow the hostile actor to “traverse” the directory tree to areas of the server's filesystem typically not for public consumption. The beauty here is that the parent of the root directory is itself. So, the bad guys don’t need to know how deeply nested the web content directories are,as  there is no such thing as too many parent directory references.

To identify the activity, IDS signatures were created that looked for a series of “../” to alert security analysts that an attack was underway.

It wasn’t long before a disturbingly easy way around detection was discovered and used. By encoding the individual characters as their percent-encoded equivalents (i.e. ‘../’ becomes ‘%2E%2E%2F’), the cat and mouse game begins. Upon discovering this new technique in the wild, a signature is written looking for a series of (%2E%2E%2F)’s. We’re safe right? Nope. Now we encode the percent sign (%) with a technique called ‘double encoding’. After our defenses are fortified against double encoding; we should be good right? Nope, now let’s use Unicode...

Those in the industry often complain about IDS false-positives, but false-negatives, which by their nature are difficult to measure in an operational environment, are more highly likely and much worse for you then false-positives. So what’s the solution here? Think about it, what can we do so we’re not dependent on these strict signatures? We’ll get to that later in “The Future” section of this post.

Low and Slow

Performing reconnaissance against your target is the first step for most nefarious activities. When this reconnaissance involves probing IP addresses and ports, your IDS is there to help you identify it. That is, unless it’s very slow. IDS detects network host reconnaissance with rate over time detections. If you move slow enough, you’ll slip right by. The IDS threshold can be loosened, however there’s a consequence. If you loosen the threshold too much, everything begins to look hostile; your IDS becomes paranoid. In other words, you will increase the false-positive rate so that the detection is rendered useless, like a whooping car alarm from the 90’s.

Undetectable C2 - DGA, and IP address fails

Establishing communications to external agents from a compromised machine is essential to hostile operations. Industry jargon calls this Command and Control (C2). C2 is the channel by which commands are conveyed, and sometimes information is extracted. Early C2 channels included static (always the same) IP addresses, and domain names. Although the protocols used varied, HTTP (TCP/80) became an early favorite as it’s almost always allowed outbound from your network, and is surrounded by loads of beautiful noise.

IP’s and domains were easy to understand and guard against by network defenders. They can be monitored, blocked, and sinkholed. IDS signatures will often include loads of IP addresses and domain names of known ‘bad’ assets. Even when protection in this way is effective, it’s a management nightmare. Most security operations I’ve encountered may have very good processes for reacting to new, known-hostile, IP addresses and domain names, but have little in the way of expiring these threat indicators. To find out if an organization has good management of these signatures, simply ask, “What’s your oldest hostile domain or IP address you watch for?” If they say, “I don’t know” or “This one from 1998…”, or “our Threat Vendor handles all of that”, you may have a problem.

The state of the art for attackers is now to use something much more unpredictable when establishing command and control: DGA domains. DGA stands for Domain Generation Algorithm. An infected host will use DGA’s to persist their C2 connection by using a quasi-unpredictable domain name that changes frequently. This avoids signature detection all together. IDS has no hope to identify these domains and the IP addresses associated with them; the hostile assets simply don’t live long enough to allow the IDS signature creation process to be effective.

Everything else - Prototyping Evasion Techniques

We’ve talked about a few of my favorite evasion techniques, but the biggest problem with IDS is that most places are using the same signature set that can be easily tested against. If you’re a bad guy trying to avoid detection, it is very easy to manufacture a communication pattern (attack, C2 technique, etc.) since many organizations use the same signature set. Because of this, even novice hostile content authors can iterate over the weapon until it comes out squeaky clean. Got ETPRO? -> You just installed a lock on your network that every bad guy has an exact copy of to practice their picking techniques on. Sure, having an IDS with common signatures deployed is better than nothing, but not by much.

The Future!

Like a phoenix from the ashes, the next evolution of security technology is rising. And it involves this subtle, but important, paradigm shift: The artifacts of active network monitoring should be a compact, detailed summary of ALL activity. They should not presume to detect attacks by emitting alarms while ignoring ALL other connections that did not match a pattern. They should emit high-confidence metrics of all activity occurring across the monitoring interface.

It’s what we do with the network summaries where the future detection paradigm becomes extremely compelling. We can apply statistical analysis, machine learning algorithms, and enhancements to the data. Most of these enhancements only become valuable within the frame of reference of the network being monitored, thwarting the single-lock problem. In addition, we can apply Artificial Intelligence to automate analysis, discovering patterns of interest from the mountain of data so fast as to compare it’s value with an entire SOC’s worth of trained analysts. (In fact, this is what JASK’s primary mission is, and we’re doing it).

Threat Intelligence now becomes much more powerful. Since most indicators of compromise from threat intelligence are contained within the network summaries (i.e. IP addresses, host/domain names, email addresses, file hashes, etc.), they can be examined after the fact. Intelligence indicating a hostile IP address yesterday offers little help to an IDS today. By storing these traffic summaries, we can apply threat intelligence over the period when it was known to be hostile, further empowering the analytics and analysts (AI or human).

So, let’s look now at the new paradigm against the attacks mentioned previously. First, the web directory traversal attack:

Here’s a sample of what a modern sensor would collect and report back of EVERY web connection. This is the “traffic summary” mentioned in this article.

Many enhancements can be attached to this summary, sourced from machine-learning algorithms, statistical analysis, static matches, etc. Here’s a few to get the juices flowing:

  • GeoIP of source, ASN name/number
  • Abnormally short client header
  • Abnormal URI - Repeated character sequence observed (“../”)
  • Abnormal URI - Rare resource requested (/Windows/System32/config)
  • User-Agent Levenshtein distance anomaly (Mozillla -> Mozilla)

You should start to realize how difficult it’s going to be for an adversary to avoid tripping all these enhanced features around the attack. Many of them source from statistical inference (e.g. the one’s with ‘rare’ in the name are unusual considering what that host/service usually does). Let’s look at the other example we mentioned earlier: Command and control to a DGA domain.

Here are some enhancements that can be applied to this traffic summary:

  • GeoIP destination, ASN name/number
  • Unusual connection to foreign host
  • Web host property carries a high entropy domain name (appears random: clkwhrlxiehlhivhwlkj1l3jk3hls.info)
  • Base64 encoded text in URI (cc=V293LCB5b3UndmUgd29uIGEgZnJlZSBKQVNLIGRlbW8h)
  • Internal host not a normal web client

What’s not shown above is the millions (or billions) of traffic summaries surrounding these two with nothing of particular interest bubbling up. It’s the AI’s turn now to stitch these two artifacts together, to begin telling a story of host compromise. Exciting, right? I think so.

Wrapping it up

So it seems IDS is dying. But it’s not dead yet. Many mature security operations rely on the IDS heavily. It’s still very good at identifying known remote access trojans (RATs), and being a configuration point by which human analysts can apply control. We looked at only a small sample of different IDS evasion strategies, and when those hopefully begin to grasp the problem and then possibly a solution. There’s a smarter way to perform detection that doesn’t rely on the IDS signature lifecycle; by collecting data on everything, and allowing offline analysis apply genius algorithms to enhance the data, enabling AI to connect the dots for us.


On the Hunt - Threat Hunting with Base64 Decoder

Every now and again you hit a day where you just feel like scrolling. One of those lazy, rainy days just before the holidays. Today is one of those days and that's where my less efficient threat hunting ideas come from. Today I'm playing with extracting Base64 strings from HTTP URI's, HTTP Cookies, and just about anywhere I can find Base64 strings in a network feed. Let’s get to it!

The first thing we need is to write a Base64 extraction function;I need some coffee this morning and need one massive brain push for this trick. The goal is to search for any strings that look like they could be Base64. Accept this regex as elementary and not the "best of the best" for  Base64  string detection, it's our quick start to prove a hypothesis that something is hiding in our network via Base64.


Breaking our Regex down


This is the lookback string for an equal sign that represents the start of a Base64 string. The reason it's a look back is the decodeBase64 function needs a 4 byte string and the = sign doesn't need to be extracted from the full string.


This is matching any sequence of letters and numbers occurring any number of times. This is likely where the most improvement can be made in my regex. Maybe on a sunnier day.


This matches two equal signs to show the end of a Base64 string.

Now that we have a regex, let’s test it and find lots of great matches for Base64 encoded strings.

Part One of our task is completed. We've  built a Base64 string detector. Apply this function to a network data stream and now we are matching and displaying  HTTP URI's with Base64 inside of them.

Part Two is extracting these Base64 strings. It's one thing to simply find them. The real trick for me was extracting only the Base64 string within the URI. The flexibility of JASK is perfection for this task and we can utilize Spark to write an extraction function. Let's get to it! We define the variable pattern as our previous regex and build a function to extract the Base64 string that matches this pattern.

Now we are cooking with bacon! We have our getBase64 function for extracting Base64 strings registered as a UDF to use anywhere in our notebooks. Now we need a Base64 decoding function. I'm lazy today and it's raining, let me see if there isn't already a function for this. Got it! https://github.com/scalaj/scalaj-http. I'm going to import scalaj.http.Base64 and call it my lucky day. Remember we are being lazy hunters today, time to register this as a UDF.

Job done! Now I can call my getBase64 extraction function first and feed the results to our decodeBase64 function and it will return the Base64 decoded string. That's it! Now let's do this at MASSIVE SCALE!

The results are fun. We've found process tracking, device fingerprinting and plenty of ads pulling email addresses of users logged in. An Interesting (disgusting) way of user-id tracking. I also applied our function to the HTTP Cookie data and found a different set of fun findings, more interesting than you would expect, but I'm going to keep that between JASK and our affected customer.

Here’s a quick screenshot of raw results from the last day:

Rainy day threat hunting, testing a hypothesis and having fun. Lots of scrolling through results, which is exactly what I felt like doing on this lazy rainy day in San Francisco. I've committed this Base64 Decoding notebook to our JASK clusters for customers to take advantage of, so please come join us! I also converted some of what we've found to signals for our Ai to learn from. The holidays are almost over and I’m ready to go back to work!


The Dangerous Rise of Ransomware

Ransomware is a relatively new type of cybersecurity threat.  It amounts to an attacker taking and encrypting your valuable data, and then charging you to de-crypt it.  The idea came about 10 years ago, as a theoretical concept called “cryptovirology”.  Although the idea is not new, it has only become a real threat in recent years. The economics of ransomware is different than the threats we have seen before it, new economics that give hope to cyber-defenders hoping to combat it successfully.

First, there is money in trafficking ransomware.  The criminal usually demands to be paid in bitcoin to de-crypt. Bitcoin fits this need perfectly; it is hard to trace and easy to launder. In terms of US dollars, the amounts demanded were in the low hundreds, but are steadily climbing higher; some estimate that $1 billion USD will be paid in 2016. Compared to spam botnets, where criminals make pennies per bot, and the actual income from spam email click through have plunged to almost nothing.  If there’s money to be made, criminals will focus on using the most effect manner with the highest payout. Today that happens to be ransomware.

Second, the business of ransomware is scalable.  When a new tool becomes available in the hacker market, criminal organizations mount campaigns, just like sales and marketing departments all around the world advertising their product. Much like a successful commercial, each of these campaigns continues as long as it makes money. If a threat is widespread and therefore scalable, then defense for it become scalable, too. There are enough artifacts to effectively study the campaigns, and build defenses for it that are based on the behavior of the campaign, not the specific signatures used. This behavioral defense is more sustainable and can limit the life of ransomware campaigns.

Third, ransomware, surprisingly, relies on open-source. Ransomware has started to appear as Github repositories, where it is modified by other hackers to create new variants. While this may sound scary, compare this another threat in past years: zero-day exploits that were secretly developed and may only be possessed by a few actors around the world. If hackers have access to open source, then security product developers have access to it as well. For those who are active members of the open source community, this puts the cyber-defender on a more even footing.

Ransomware represents a new combination of economic factors in a cybersecurity threat.  The revenue stream is more direct; from the consumer to the criminal, with no middlemen.  It operates on a larger scale and it does not rely as much on limited supply inputs. This attracts a lot of attention and innovation from the malware community, but it gives security products a chance for strong innovation.

As Chief Data Scientist at JASK, I study the network behavior and tools of Ransomware to better defend companies against a dominant threat in cybersecurity today.

From Big Data to Beautiful Data: Bridging the gap from Threat Hunter to C-Suite graphs with Zeppelin notebooks and D3

screen_shot_2016-11-22_at_9-20-30_am_720In my previous posts we worked through a number of Threat Hunting queries and data mining ideas. In the end we left off with how to demonstrate and translate value to the C-Suite. This has lead me into the realm of presenting data in beautiful ways. At JASK, customers access big data with Zeppelin notebooks, but Zeppelin begs for better implementations of beautiful data, providing only a small number of graphing types. A pie chart and a bar chart are not going to cut the mustard when demonstrating value up the chain. Cue D3 (https://d3js.org/) and its infinite flexibility in displaying beautiful data.

Working on the cluster from one of our research sensors at a very large Tech University, we’ve written a function to parse Top Level Domains (The .com, .org, .net portion of a URL). Using the java.net.URL function we query our data for the TLD and search for suspicious TLDs in HTTP request headers. Here is the code where we apply our TLD UDF (spark) definition to the dataset.



This query results in your standard big data row/table type of result. (Something an analyst might consume)



Now it’s time to start the Beautiful Data transformation! (Something the C-Suite can consume)

Here we are printing html and javascript within a zeppelin notebook against json data output. Instead of staring at rows and columns of big data, beautiful data translates up the management stack and helps tell a clearer story of the threat hunter’s findings.

The write once and use forever concept, works wonderfully with Zeppelin + D3. In this example we graphed TLDs, but we could easily represent a different Threat Hunting dataset with this graphing method. Graphing makes it easy for everyone to see the most frequently visited TLD and the least frequented TLD and that’s the job of beautiful data. Once more we’ve applied the same TLD notebook to all of our customer’s clusters to experience their own Beautiful Data.

Why We Picked Tensorflow for Cybersecurity


When I started in security analytics several years ago, the choice of tool and platform was typically dictated for you, usually based on earlier investments the company had already made. These days, scientists have the opposite problem: a dizzying array of tools in a variety of licensing modes.  The frustrations of limited toolsets have been replaced by the anxiety of choice. As wonderful as unlimited options may seem, in reality we must limit our options in order to be successful. Ideally, an organization can converge on a single choice: not perfect, but one that allows maximizing benefit while decreasing the challenges of maintenance.

At JASK, we have chosen a toolset that we think does that: Google Tensorflow.  At a high level these were the reasons:

  • Data Science needs a toolset that can take advantage of either CPU’s or GPU’s, or a mix of them.
  • A product for model building must recognize that the best language for modelling is not the best language for algorithms.
  • The experiences of local development and cluster development should be the same

We need more cowbell.

It seems intuitive to use as much processing power as a piece of hardware offers; unfortunately, we rarely have this option.  Most notebooks and workstations either have a combined GPU/CPU on board (not always NVidia), and high performing GPU’s are a special-option only on most servers. On the other hand, while a GPU is fantastic at certain problems (matrix multiplication, for example) no class on GPU programming would tell you to do everything on a GPU. If you did hear this in a class, I recommend a supplement Heterogeneous Parallel Programming.  Tensorflow meets this requirement: I can develop on a laptop with no GPU’s, then run the same node on a cloud instance with an array of GPU’s installed.

A statistician and a mathematician walk into a bar …

Back at University, Computational Finance and Applied Mathematics shared some faculty, even attended the same graduation ceremony.  Yet, all their coursework was in R and ours was in Matlab, which I think is the most concise illustration of model vs algorithm building in terms of software tools. Here’s another one: some believe in having a minimal knowledge of each algorithm’s inner workings and a wide view of all the possibilities and available tools, while others believe in the need to understand fewer algorithms but deep enough to program them yourself. I now have a theory for a likely reason behind this: your position on the spectrum I described, is a function of how much hate and fear you have for C and C++ programming.  To unite these examples, the Quant’s and the Amath’s both knew python, and to take advantage of decades of numerical optimization you have to do it in C (or let’s face it, Fortran). ML solutions must be built on something that can bridge these two worlds: Tensorflow’s Python code for the model, which is compiled into C builds that bridge.

Anyone know a pop culture reference about parallel programming? 

As much as I would like every data scientist in the world to have their own Hadoop cluster, we know that’s not going to happen. Also, in line with Moore’s law, today’s laptop surpasses the main frame I helped my father load punch cards into when I was little. Doing your development on clusters is expensive, and debugging and testing become problematic as well. I have found that I am more willing to give up some application performance than to pay the price of easy debugging and testing. I find that with some education, data scientists can be persuaded to do their development with “small data”, and we can treat cluster paralleling and performance in a separate step. The ability to develop, test, and run on a local machine and then treat parallelization as a configuration step is a very nice thing about Tensorflow.

Does Tensorflow have everything we need?   While baked-in visualization and a large user community are very beneficial, I would trade that for a tool that ran GPU’s from different vendors in a heartbeat. And while it was our choice, there are other good ones to evaluate for yourself.  Your mileage may vary, when deciding whats the best tool for you, I recommend also looking at Theano, DSSTNE, and sklearn to see if they are a better fit for you.

But as a team, you have to start somewhere, and my experience has shown that “somewhere” should be somewhat close to what it will look like in production, and something that has enough capability so that you are not limited greatly or required to have 50 different software packages for 50 problems.



Threat Hunting with your hands tied - This is Big Data Part II


Threat hunting isn’t only about finding compromised assets, it’s also performing the predictive function of finding the holes a malicious attacker might take advantage of. As I mentioned last week, your customers are your best hunters, accessing your website in a million different ways, with a thousand different web browsers and hundreds of different types of devices. This doesn’t include the automated mass vulnerability scanners, such as Shodan or research projects like MassScan that are scrubbing your applications as well. Today I’ll share some of my queries and I hope you share some of your most recent hunting exercises and queries with me.

At JASK we utilize Hadoop and Zeppelin notebooks. This allows us to write functions in spark and query our data using spark-sql syntax. This also allows us to export notebooks in json to share with the security community, work with our customers and the threat hunting community to build even more powerful notebooks and applied research. Now onto the data.

Searching for DNS non-authoritative answers for customer domains:

The results showed a large number of hosts querying the internal DNS server for customer.com.customer.com. Example: jask.com.jask.com. The internal DNS server did not have a record for this, so the query would then be forwarded to an external DNS server.  This looked strange and we realized this misconfiguration would point all users to their CMS licensing manager page since this particular domain was not registered under their license. I would categorize this as information disclosure, resulting in disclosing the CMS server version and dropping everyone to the admin login page of the CMS (both internal and external users). From this information disclosure it turns out they were running a vulnerable CMS version as well. Were they exploited yet? We had been in this POC for a few weeks and can query our data to determine if anyone accessed the CMS admin page while we have been in place. We are also able to close the loop and write a rule to produce a signal for logins to the admin page. Often times the business will decide this is not a risk and we simply keep it in our hunting notebook.

The zeppelin paragraph:

SELECT src_ip.addres
FROM dns
WHERE authoritative != trueandquerylike"%.jask.com"GROUPBY

Building on the CMS information disclosure story we mentioned earlier. Here’s the query we used to perform a historical check and determine if anyone had accessed the vulnerable CMS.

SELECT src_ip.address,
FROMhttpWHERE request.uri 
like"%CMSSiteManager%"or request.uri 
src_ip.address notlike"192.168.%"

Non-Standard software - User-Agents:

Most of the customers I’ve worked with function like the wild west, with BYOB and no managed software or hard and fast policies. Every now and again you get an easy one where the customer maintains an approved software list and possibly even an approved web browser. This makes for easy anomaly hunting or “Never have I seen X” type hunting. If we see anything that does not match the customers “approved” user-agent, we have a finding worth chasing.  Below is a sample query, but usually you’ll add more to the query, an internal subnet to hunt or regex of acceptable user-agents. Below is a sample of a basic Zeppelin paragraph, I will leave the rest to your own imagination and hunting specific hunting exercise. Here we are looking for all IE 11 User Agents. This is to get your mind thinking, but this one is fairly simple for this post.

SELECT src_ip.address,dst_ip.address,
FROMhttpWHERE request.headers['USER-AGENT'] != " Mozilla/5.0 (compatible; IE 11.0; Win32; Trident/7.0)"FROMhttp

Maybe you just want to see what your TOP 10 Most popular User-agents are?

SELECT request.headers['USER-AGENT'],
FROMhttpGROUPBY request.headers['USER-AGENT'] 

Maybe you just want the distinct User-Agents in your network? This query has found me anti-virus agents fetching update lists and validating the license key through a base64 encoded User-Agent string. Lame…


None of the above queries are all that efficient and depending on how tight lipped the network is the more clarity these queries can provide. Nesting queries can help clean the results and mean the difference between having a threat hunter analyze 100 results or 1,000’s.

Wasting your time searching for ad-trackers?

I’m not aware of what can be done here short of our government stepping in to protect our privacy and this hasn’t bore me much fruit in a hunt. It has found me people accessing inappropriate content in the workplace. Even while the organization had invested in a web proxy and end-point software to prevent adult content in the workplace. We could use this to validate the effectiveness of those automated content blocking tools and web proxies. Ad-tracker’s give up a lot of information about the quality of the website you are accessing and you just might find this query bearing fruit for you to find users searching websites in “poor” taste for the workplace. I find the more deceptive the ad-tracker, usually the dirtier the website. Here’s one of the most common ad-tracker’s I’ve seen recently.

FROMhttpWHERE request.headers['GET'] like"%beacon.krxd.net%"

Searching for plain text passwords floating around.

This one can be a bit noisy, so make sure to tighten it up after you scrub your first round of results with a few “not like” statements. We’ve found poor business applications with hardcoded passwords crossing the network boundary and floating around internally.

select src_ip.address,
from http 
request.uri like "%password%"

Searching for plain-text Protocols:

We all promise plaintext protocols are not allowed on the network, but we always find them. How about we take a look at the types of FTP activity happening and the exact commands that were run? One piece of information against logs for hunting. If you don’t control the FTP server, do you think the FTP server is going to send you the logs? This is the type of hunting that MUST be done with network data. Log data is a ho-hum source for hunting, maybe you have it, maybe you don’t. You just don’t know if you are getting the true results with logs, you never know which servers are logging. Sometimes the servers running are not yours, but a service a user throws up to get their job done quickly. That was the case with one of our most recent hunting exercises finding a quickly stood up FTP server on the internal network.

SELECT src_ip.address,

Maybe you are searching for anyone using those pesky Dell or IBM superfish root * certificates? This is just a dabble into the power of hunting based on TLS certificates, the cipher being used, and more. I’ve yet to find anything in a customer network related to weak ciphers or export encryption and that’s a good sign. TLS parameters are easy to hunt for and you should do it. It’s not always about what your certificates look like, but the certificates of the sites your users are interacting with. This might be the case with encrypted malware and TLS encrypted botnets using self-signed certificates or misconfigured certificates. Hackers make mistakes and it’s your job to catch their mistakes. They are doing a good job at catching ours.

select * 
from tls 
subject like"%edell%"

The story goes on forever, are you focused on the perimeter and want to see any connections that were established from external to internal? We remove RFC 1918 space in this query. As we graduate our knowledge in Spark we begin to define variables utilize functions, but for this article you’ll see no variables are used and we simply code the customer’s used RFC 1918 private addresses into the query.

SELECT src_ip.address,
FROM flows 
WHERE conn_state = "S1"and dst_ip.address like"172.%"and src_ip.address notlike"172.%"and src_ip.address notlike"192.168.%"and month = month(current_timestamp()) 
GROUPBY src_ip.address,dst_ip.address,dst_port,conn_state 

Still loving DNS and want to see your top 10 DNS queries? Your domain will likely be the top hit, go ahead and set it as a “Not like” and keep paring down those not like statements for a personal fit. Remember this is a write once, run many times hunt. Investing your time to write good queries the first time will result in a more efficient and quicker hunting exercise in the future.

FROM dns 
WHEREquery != ''andquerynotlike'%jask.com'GROUPBYqueryORDERBYCOUNT(query)

Have any ugly buggers trying to perform DNS exfiltration? Try searching for DNS queries of long length. This is a pretty weak one and almost every hit ends up with spotify’s long DNS queries for playlists.

SELECTqueryFROM dns 
WHERELENGTH(query) >= 100andquerynotlike"%.er.spotify.com"

Weak Kerberos Ciphers?

RC4-hmac and DES are seen on Windows XP and up to Windows 2003 servers. It’s something most environments should be moving away from for obvious weak cipher reasons. This query is great for validating strong ciphers are used throughout an environment and calculating the risk associated with where these weak ciphers are occurring in your network.

FROM kerberos 
cipher like"%rc4%"or 
cipher like"%des%"

Finally, let us not forget the world of executables. Those hundreds of thousands of dollars spent on full packet capture devices for the sole business purpose of extracting executables. Save yourself:

select src_ip.address,
from file 
group by src_ip.address,dst_ip.address,hash.sha256,mime_type

That’s a small sample of the 100’s of queries, paragraphs, and notebooks we’ve built at JASK for our customers to jump right into hunting in Big Data. We prefer to organize these queries into focused notebooks, such as DNS Security, HTTP, and TLS notebooks and run them at the notebook level vs. paragraph level, adding tremendous value and efficiency to a threat analytics program.

What to do with the results and wrapping up the Hunting Exercise.

Results are nothing if you can’t wrap them into the business process. When the hunting exercise is complete, take your query and turn it into signal intelligence to drive Artificial Intelligence. In JASK we have a rule engine for this exact design. Teach JASK a new skill and the AI becomes smarter. No security detection technology will catch everything, but when humans, customers, the data science, and security community are able to continually improve detection through hunting exercises and close the loop, we are one step closer to defending the business and turning hunting exercises into a repeatable process.

Happy Hunting!



Telling the Security Story

Data analytics and machine learning can be very empowering for security, but don’t lose sight of your true goal when using them.

In work as an IT auditor, a security investigator, or threat analyst, there is a common need: they have to “tell the story” of a risk, incident, or threat to make change happen. The story must have impact to motivate action, but how many security practitioners feel they do that consistently? Is it the tools, the training, or both? There is a shared responsibility in telling the story.

Telling the security story is no different than telling any other story. People must be able to follow the order of events in the narrative. As this is not fantasy, they have to remain credible with the plot, characters, and detail at each step. Besides using one’s own credibility, the audience has to take away some internal insight; if not, then the writing is stereo instructions, not a story.

Data analytics and machine learning give people powerful tools to help tell these stories. They can make the story more concise, provide unexpected plot points to include, and definitely increase the amount of insight the audience takes away. However, it is by far better to use it to enhance than relying solely on it to motivate the user.

There are other useful things that allow you to tell the security story well. First, build the story out of small and objective observations; these can be tied together through the narrative. Next, treat it like a conversation, not a one direction sales pitch, allowing time to solicit questions and/or input. Don’t get lost in the details of background in the beginning, make sure each point of detail can be correlated to another pivotal point in the story . Lastly, don’t skip to the end: make sure to take the audience along for the journey or they won’t be there when they are needed.

How do machine learning and data analytics make the difference? They allow the focus to move from hundreds of lines of data into just a few. Machine learning’s output is an objective voice to the data. Above all, having a centralized theme, with occasional pivots into deeper detail (raw logs, for example), is a powerful way to tell the story with success.

At JASK, we have created a product that uses the power of machine learning and data analytics to tell the best story in the cyber security world. As the Chief Data Scientist and Director of Products, my experience and knowledge get translated into tools for security practitioners worldwide.

Threat Hunting with your hands tied - This is Big Data Part I

The Stage:

When walking into a Fine China shop, you can look, but Do Not Touch! This concept applies in a customer Proof of Concept; you can't influence the infrastructure or applications, you can't review the website or encourage an application to disclose its version or variables to expose its vulnerabilities. It's the mother of all challenges, one I live with everyday when working at JASK. Welcome to Threat Hunting with Big Data Science where the rules are clear – DON'T TOUCH.

The Measurement:

In the world of AI driven cyber security, it takes time for technology to learn the network and listen for threat signals reaching a noise level worthy of human interaction. Just as a large city such as San Francisco, CA would not rank its safety on the number of tickets issued, AI driven cyber security cannot base the number of alerts generated as a success metric. The number of events generated is not a metric any efficiently running SOC should accept as a measure of its health. While time is taken for AI to solve the challenge of learning the network, what functions remain for the SOC personnel and SE to do?

Thankfully that’s Big Data; the gold we are panning for sits within and the coal that keeps the fire burning is continually produced. In a Hadoop and Spark backed platform, the questions come as fast and fluid as the answers. It's threat hunting with your hands tied, Big Data Science meets Signals Intelligence with network data. The underpinnings of Spark and Hadoop build a base for an AI driven platform and a big data hunting ground. The data is exposed through Zeppelin notebooks, making it the perfect playground for threat hunting and the moment my job gets interesting. The blinders are taken off and we press ‘Play’ on the notebooks.

The Goal:

"Everyone is compromised" right? That is what has been preached more than a decade and what we are still told today. With this mindset, you would expect that in a POC you would find something bad, compelling you to purchase bad. Unfortunately for my bank account, the reality is that while everyone is compromised (and it's relatively easy to locate a compromise), how large of an impact will it have on the organization? It seems the “Everyone is compromised” statement mostly addresses trackers and adware 99.99% (four-nines) of the time. The monetization of employees via adware isn't something a CISO prioritizes as a high-risk to the business. You have to dig deeper to make the payday and that's when the real hunting begins. I would likely modify the phrase "Everyone is compromised" to “Everyone is critically compromised at some point in time.” The job of threat hunting isn’t to just detect a threat, but to analyze and predict threats the company classifies as high risk.

The Hunt:

Threat hunting is about letting the network tell us where to look. When looking at network data, we see DNS authoritative answers for non-authoritative domains and top DNS queries for non-internal assets. We verify strong TLS ciphers are being used throughout the enterprise, drill down with a focus on web server response codes, request headers, response headers, suspicious user-agents on internal assets, and analyze the network data for how a business' customers interact with the websites and applications both internal and external. Do we see fast flux domains? Do we see rapid queries? Do we see suspicious executables (those hidden within zip files) or file transfer methods? Do we see an excess of SMB, RDP, or authentication protocol traffic? The questions we are able to ask Big Data are limitless and the "Big Data Lips Don't Lie".

The Discoveries:

These Big Data Queries perform the predictive function of viewing how a customer's internal, external, good and bad users interact with the business. We don't have the ability to touch an internal asset or application and influence the results, however, every business has customers. Whether it is an employee or external user, these personnel are hands-on performing the pentest. You may hire a “professional” pentest once or twice a year, but the reality is we can never predict with 100% certainty how customers will interact with the applications. Where is the company accidentally exposing itself and how do you determine how at risk your company is?

Part II: The Results (Coming Soon)