Love The Vendor That Loves You Back

The sales machine is a complex beast and many may misinterpret who a good sales team is ultimately meant to serve. When salespeople want you as a customer, it’s their goal to bring you into the fold and partner with your business. In the very competitive career world of sales, these guys and gals find more satisfaction than you may think in helping people and watching every business they touch become more efficient (this is what drives the money and bragging rights they often use as a measure of who is the best). It’s also easy to argue that salespeople are the busiest people you will ever meet - look at all of the emails, phone calls, and surprise visits they make to your office. If there’s anything I can attest to throughout my career as a both a customer of a sales team and an engineer supporting it, it would be the shear fortitude, determination, and dogged pursuit that a good sales person has.

Allow me to show a different perspective on sales interactions and help you realize all there is to gain by learning to manage your vendors - specifically your salesperson. In the cyber security startup ecosystem, you will experience the most driven sales and engineering teams in existence as they fight for your business and their companies’ success. This dedication is begging to be leveraged by prospects and customers who oftentimes fail to recognize the opportunity, and perhaps choose to belittle the interaction.

An Analyst has a number of tedious tasks and at least as many vendors who claim to assist with them. Sticking to the most relevant task, I could count 10-20 security startups who claim a “reduction in alert fatigue.” This begs the question: can they actually make a difference? Say you decide to engage with a vendor of this nature. Maybe you think they are selling dreams, or worse yet, snake oil. Then you decide to take a combative or defensive approach, either ignoring them entirely or - in the other extreme - setting a course to disprove them.

Lets pause here a moment and consider that perhaps you are going about this all wrong.

What if you were to change the approach from radio silence or discrediting them bit-by-bit, and instead turn the conversation into a job interview? Rather than setting the conversation up for failure  in binary fashion with finite outcomes, why don’t you try more open ended questions?

Here is a list of five questions I recommend you ask your sales contact to drive a more productive outcome:

  • How many alerts can your product review for me in a day?
  • What types of attacks can your system detect?
  • How long will it take to tune your product to achieve the advertised results?
  • How strongly do you stand behind your claims?
  • Are you willing to put your team on your tool to tune and review alerts until your product lives up to its claims?

This last question may be the most important. It is a call to action for the vendor to put their money where their mouth is. If you’ve partnered with a solid vendor, they should be  willing to dedicate resources beyond their technology to ensure the value they claim is realized and your problems are solved. They must be willing to invest in the relationship.

If you continue to move forward after asking these questions, you’ve not only expanded your capabilities in terms of technology, but you’ve added virtual headcount to your team at no additional cost. Believe it or not, there is also benefit to the vendor beyond the sale price. A willingness to engage at this deeper level is a tremendous learning resource for any company. Particularly when it comes to hot new technologies like AI, improvement can’t happen in a vacuum and there is no silver bullet that can solve all of today and tomorrow’s problems. It's so important to invest in a team with an interest in your success, and whose success you’re invested in, in return. Solving the immediate problem of today is proven to have the adverse side effect of shining the spotlight on the new problem of tomorrow, meaning you need a vendor committed to the long game. Your sales guy may not be your quarterback in this endeavor, but they could at least be an all star receiver.  

The moral of the story - don’t be so quick to dismiss that cold call or to rip apart a new technology. If you take an objective approach and interview for the good sales guy who is ok being held accountable to problem solving and results, you may find yourself in the company of people who are enjoyable to work with, willing to invest in you and your business, and able to give you a lot more than the face value of your new technology purchase.

Visit and get a feel for what I’m talking about today.

Defense in depth: The Equation Group Leak and DoublePulsar.

by: Rod Soto and Daniel Scarberry

Unless you’ve been living under a rock you are probably familiar with the recent Shadow Brokers data dump of the Equation Group tools. In that release a precision SMB backdoor was included called Double Pulsar. This backdoor is implemented by exploiting the recently patched Windows vulnerability: CVE-2017-0143.

For detection, we are going to first focus on the backdoor portion of the implant, hunting for traces left behind on the network.  JASK customers have access to 90 days of full network meta-data  to reach back in time and historically analyze or hunt for the first entry point.  This allows our customers to quickly determine if they were one of the unlucky ones to be compromised by the newly leaked exploit and implant.

So let’s get straight to it. Double Pulsar is an SMB injected backdoor and that means it is time to focus on the SMB protocol. First of all you should not have SMB open to the public internet!  Why people still do this is beyond us… That being said, SMB is a great protocol for threat hunting, from the SMB attack used in the Sony hack in 2014, to some of the older worm’s. It’s one of those protocols that just seems ripe for behavioral based detection. Baselining the number of requests, who’s making the requests, and what SMB commands are being used seems like a good place for anomaly detection and searching for suspicious behaviors.

To kick things off, let’s replay the attack through a JASK sensor and look at the SMB commands field in Trident Investigation:

Based on SMB commands, there are three fields that stand out for analysis:

  • Commands.status
  • Command.sub_command

Feeding a DoublePulsar Ping to a clean, non-exploited, non-infected host results with SMB meta-data:

  • Commands.status="NOT_IMPLEMENTED"
  • Command.sub_command="SESSION_SETUP"

As seen below in notebooks:

In a Wireshark analysis this clean value shows up a Multiplex_ID of 0x41

Feeding a Double Pulsar ping to a dirty, exploited infected host results with SMB meta-data:

  • Commands.status="NOT_IMPLEMENTED"
  • Command.sub_command="null"

As seen below:

It seems the command.sub_command ends up being null in an exploited machine. In Wireshark this infected host results with a Multiplex_ID of 0x65.

One more interesting piece of data is that a first time exploited host ends up with a Multiplex ID of 0x52..

If you follow the exploitation in sequence you’ll see that during the exploitation, an initial Double Pulsar Ping is sent to check if the host has already been compromised. (Showing the previously discussed 0x41 for a non-exploited host), meaning it’s good to run the exploit against.

Continuing our analysis, I notice that during the CVE-2017-0143 exploitation phase “Eternal Blue” we also see an SMB ECHO command. Checking for the frequency of SMB commands, we observed the SMB Echo command is a rarely used SMB command and perfect for a behavioral-based detection of rarely seen SMB Commands, or even a “first time seen” type of anomaly.

Expanding from that analysis , we realize there’s an entire set of SMB commands that have been deprecated or unused and should be understood as suspicious behavior within the SMB protocol. For that, some behind the scenes SMB protocol reading needs to be done:

After some light protocol review and research, we  ultimately identified a set of SMB commands that are rarely used. By flagging these commands, we create anomlous

“Signal” to be produced in JASK Trident anytime one of these rarely used SMB commands are seen from a particular host for the first time. The goal with that level of signal production is to further protect users and notify threat hunters of any Zero Day attacks with rarely used “NOT IMPLEMENTED” commands. We will feed these signals to our secondary supervised learning model to ultimately end up with high confidence alerts.

The Shadow Brokers leak of NSA tools is already being ported to exploit kits and frameworks to be used in  malicious campaigns. These exploit kits enable malicious actors including those of a lesser technical level, to enhance their ability of targeting and compromising their targets; thus finding vulnerable targets with and other public mass scan tools. There is no hiding!

While the goal of this research is purely technical to ensure defensive measures, it’s a great example of the work we do often in conjunction with helping our customer create agile defense to emerging vulnerabilities and exploits.  We are moving quickly to historically check if our customers were compromised as well as push new algorithms to our customers to further protect them from not only this Shadow Brokers release, but any further SMB attacks as the world continues to move forward as it always does.

Threat Hunting Part 3: Going Hunting with Machine Learning

Due to being busy with proof of concepts at the end of the quarter, I’ve been on the prowl for lazy hunting ideas. Every security person’s dream is to have interesting data come to them, but is this possible? Apache Spark's MLlib seemed like a good place to start the hunt.

I wanted to leverage Apache Spark’s MLlib combined with in-bound and out-bound data to bubble up anomalous traffic talking to suspicious countries.*  At JASK we leverage machine learning to produce behavioral based signals and sometimes I decide to go hunting based on our raw data stacking instead of output from something like Spark+MLlib.  These notebooks are great way of pitting one behavioral based model vs. another approach, checking false positive rates and figuring out if your model is ready for prime time.

*Disclaimer - I’m an Security Engineer, not one of the ML experts on the team but I know where to find help!

Here’s how to do it:

Step 1: Import Apache’s Machine Learning Libraries with KMeans

Step 2: Formulate what data you want to run KMeans against.

Here I’m querying the sum of the outbound bytes.

Step 3: Convert the dataset to type RDD:array[double].

The resulting query above needs to be of type double for KMeans, so we map it to double. The reason for this is that num_en is a SchemaRDD. When you collect() on it, you get and Array[org.apache.spark.sql.Row]. Thus, num_en.collect()(0) gives you the first Row of the Array. The technical reason behind this is that a dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

Step 4: Define the number of classes.

Step 5: Define the number of iterations.

Step 6: Evaluate clustering by computing within set sum of squared errors.

This is where we evaluate the model and determine if it accurately represents the data.  I’m merely implementing this to get some anomalous traffic back as a test, and I am not a KMeans expert.

The key is to minimize the Euclidean distance among the points in the groups. The quadratic error is the Within Set Sum Squared Error (WSSE).

Step 7: Show the results.

Step 8: Save and load the model.

Step 9: Have your coworker look over your shoulder and tell you that Spark 2.0 has deprecated Apache’s MLlib!

BAAAAH!!! Defeated…except I wrote and implemented this a couple months ago and I’m just now getting to the point of writing the blog post around it (better late than never!). The above still works. It sounds like Spark has simply stopped maintaining the library in favor of other approaches, but it is still available for use. Essentially, no further maintenance is going to be performed on the MLlib. I’m not gonna let a good notebook go to waste due to a ML library being deprecated. It may be true that they are moving on from MLlib, but it doesn’t change the fact that hunting based on anomalies could be the place to start and K-Means is a helpful way to start that investigation.

Since Apache’s MLlib sounds like it’s going to be deprecated in the near future, I’m going to shift our goal of querying anomalies from MLlib to leveraging the machine learning anomaly detection within JASK Trident. This way I’ll know that by the time our user community reads this blog post, the underlying detection method will still be around.

To kick off this series of notebook paragraphs, I have a “signal” table. If you aren’t a  Trident customer, you’ll have to utilize your own Anomaly detection, such as the example implementation of APACHE MLIB shown above. If you are a Trident customer, you are in luck, because we expose our anomaly detection to users via the signals table.

Step 1: Here we are querying the Trident signal table for Anomaly detection. I’m storing the results of the query into an array that I will later automate the analysis.

The return type of the above query is a collection of arrays and in order to work with just the ip.address and feed a single IP at a time into our secondary analysis, we need to focus on grabbing each of the ip.addresses from the result.

Step 2: To solve the Array of Arrays problem we map each first item in the array to an Array of strings with the following easy paragraph.

Below you can see our result is now Array[String]. We are getting closer.

Step 3: Now that I have a clean list of suspicious IP addresses, I can throw my list of IP addresses through any type of secondary analysis and find out what these IP’s were talking about. What destination ports were they talking over? What website were they sending/receiving the data from? Heck, now I have a suspicious list of IP’s, I can perform an auto-analysis of them and then write these out to my firewall and put a block rule in for each one. All of those types of questions and actions I could throw into a paragraph and then understand WHY my anomalies were being generated.

Here’s  a quick look at the countries involved in my Anomalies:

My first question will be simple: How much traffic was being sent or received between the two hosts?

My second question will be based on the result that a set of anomalies were around destination port 80. What website URLs were being requested?

What I found from the above query was no results for one of the hosts!!

That led me to ask more about how many bytes were transferred:

Other paragraphs could ask similar questions based on the type of anomaly that was generated. What country was the traffic was destined for? What was the HOST that was requested? Was this related to data-exfiltration? What were the DNS queries? What were the DNS answers?  I now have a notebook for auto-analysis to determine if all those anomalies on your network were real threats or just users binge watching World Cup Skiing, (which happens to be what I am currently watching). Once you are comfortable with your auto-analysis you are free to export that to a file or write it to HDFS and use elsewhere, closing the feedback loop back inside the product for any future detections.



On the Hunt Part 2: Identifying Spear-Phishing Recon Activity-Collection of User Details with Ads for Spear Phishing Campaigns

A few weeks ago, I published a Base64 decoding article. The findings from this ranged from process ID numbers, application and version detection, to the blatant collection of email addresses. With that in mind, today I’m going to focus on Ads. Not adware, not malvertising, but just ads. Ads are the massive security hole in our network and the invasive species of our personal lives. I’m focusing on ads for the operational efficacy during the Reconnaissance Phase to support a strong spear phishing attack – Inspired by the Grizzly Steppe news.

I see targeted advertising happening day in and day out, but how much personally identifiable information is being collected about users? How much of the internet has become consumed by ads? How can we tell heads from tails or good from bad ads? I’m casting my net wide to catch what attackers might use as shrimp (bait) by spear phishing attackers later. With my morning coffee in hand and a comfy seat on the office couch, here is the notebook we are going to start with.

Base64 Decoding Emails

This is a slightly modified version of the Base64 Decoding Blog post from last week with a focus on decoded strings that have emails in them. (By adding a simple search for the “@” symbol in the decoded URI string.)

I would love to show you the results, but in the interest of protecting users emails, I’m going to ask you to trust me, there are hits. I’m not saying buckets of emails are leaving the network, but a handful of the most clearly identifiable pieces of information ([email protected] and [email protected]) are being collected by marketing and ad agencies around the world (KR, SG, US).

What information can you glean from an email? I’m easily able to identify where all of these people work based on the email domain. With the gmail, yahoo, and other webmail addresses I’m also able to identify login portals to imitate when I aim my exploit and go spear phishing. With emails in hand, we’ve proven the first step that emails are being collected by ads. Remember, this is being done at the ad agency level to supply their customers with perfectly targeted ads and to learn as much about the customer as possible to best target an ad. Only good things come from targeted ads :)

Based on the data being collected by the ad agency, I as an attacker/customer can request the ad agency target everyone that works at (Company X) with my exploit. Job complete, why am I even writing this post? Why do attackers even waste time with Recon? Ad agencies do real-time human tracking as a core competency and business. They track and categorize the human across mobile, TV, and PC. They are really good at what they do.

Maybe I’m being paranoid. I need more facts to backup my previous thoughts of *innocent ad companies tracking everyone. Maybe users are traversing the dark web or risky websites that would host malicious ads and I’m making assumptions. Maybe, it’s one country targeting users, like Russia? How hard is it to gain attribution? Let’s go to our data:  

Top 10 Non-US Traffic Destinations:

You can clearly see a large portion of traffic is going to KR (South Korea) and you might praise yourself, “Aha, the South Koreans are after me! It really was my users going to risky websites”. Don’t - Ask the data please: Let's pivot to the notebook for suspicious countries and analyze http requests. For this we are going to leverage our URL parsing UDF and parse out the domain name for quick and easy viewing of what all this KR traffic is about.

What is The Traffic to South Korea.

Traffic to South Korea Query:

100% of my South Korea traffic is going to Yahoo.

The data shows users weren’t going to risky websites when they went to South Korea (They went to Yahoo). What the data supports is that the internet is a global service and ads are hosted from around the world. Putting this together, you start to see why many of the IOC’s in the Grizzly Steppe release are all over the world and from trusted sources, It’s impossible to gain attribution to an attack source based wholly on network traffic GEO source or destination, notice Russia is nowhere to be scene and I actually haven’t seen as much Russian traffic lately as I had in previous years.

As an attacker, I could leverage an ad or marketing agency to pinpoint exactly who I wanted to target. The ad is not malware, it’s not malicious (I’ll debate an ad tracking me from my phone, to my PC, to my TV as being malicious), but it’s not illegal, unfortunately. It’s highly efficient Reconnaissance and attackers will and should be take advantage of this service.

Why port scan, URL crawl, or use Recon-ng, when I can pay an ad network to supply me with everything I need? An attacker can sit quietly on the sidelines prepping his exploit and hiring out the recon to ad agencies. This makes the discovery of any Reconnaissance Phase difficult, an attacker can now jump straight to the Delivery Phase of the Kill Chain by leveraging ad agencies.

Expanding on the idea of a global internet and ads being hosted around the world. What is the main source and subject of international traffic? What is Yahoo delivering to me from South Korea? I live in San Francisco, why does so much of my traffic get processed and delivered by another country? What is coming from these international locations? To help answer these questions:

Top 10 Non US HTTP Domains


Top 10 Non-US HTTP Domains

Ads, ads, ads, and more ads. smartadserver, lijit, stickyadstv, adsrvr, bluekai, google-analytics, all of it ads. This new internet is depressing me. It seems the majority of international traffic is ad networks.

In business, these collected emails, user-id’s and application version detections are used to display “relevant” ads for things I’m never going to buy, but thanks for your effort. In an attack scenario, that same data will be used to determine what sites Daniel visits regularly, where he has other accounts at, and general awareness of his lifestyle. With that information in hand, my spear phishing campaign is beginning to look closer to spear fishing in a stocked pond.

Let me step back to Grizzly Steppe and shared hosting. Many of the IOCs in Grizzly-Steppe report were on domains like yahoo, BlueOcean, and multi-tenancy platforms. Does our data tell us a story about this?

Top 10 Destination Organizations:

Top 10 Destination Organizations:

The Top 10 Destination Organizations are advertising providers and are ALL the major platforms for advertising, lead generation, and marketing providers. AppNexus is hosting adnxs, Google is googlesyndication, Amazon is hosting springserve, Akamai is fronting taboola. Thinking about this it’s not a surprise. Every webpage has a dozen ads, so the ratio of good clean internet traffic vs ads gets washed out - come to think of it, I could ask my data for the real answer of average connections per webpage request and maybe that’s a good indicator of risky websites? Ah, I’m an idea machine that never stops producing ideas, I’m going to work on that, but at some point this blog post must end, because it’s Friday and my coffee is now cold and not in a good cold-brew type of way.

What is possible with targeted ads?

Outside of Grizzly Steppe and the DNC attack, let's bring this closer to home with a real world example. If you work at any publicly traded company, it’s predictable that employees of the company will go to or to check the company's stock ticker and see how the shares of the company are doing. Sounds reasonable.

I’m going to hire the ad agency to target my ad with a few items - Users that currently have a vulnerable version of an application running (detected by the ad-agencies PID and application version detection). Target only users that work at the target organization. The ad agency knows who I work for, because they’ve tracked the websites I’ve visited for months and years as they host ads on most of the internet, think facebook ads, but also based on the src_ip.address GEO organization. Further pinpoint this ad directly to [email protected], because you’ve captured his email through a previous ad-campaign. Now tie all of these meta-data pieces together and fire off my ad. I could also reverse the ad campaign for a single targeted email and request to give me the Sales VP’s email who works at Jask. The ad agency has a collection of emails at the target organization and has auto-enhanced these emails with the person's title and importance within an organization with a site such as Linked-In.

How about an email that reads similar to this one? “This month's ESPP paperwork needs to be electronically-signed, please login to the link provided or open the attachment, sign, and respond in order to approve this quarters shares. This must be completed by Friday as we did not receive your response to our previous email.”  

We target this email at users within the target organization with a known application weakness (from the ad agencies collection of running processes). Sip some coffee….get a little anxious….sip some more coffee…profit.

-Poor internet

**My apologies for the wordy blog post this week. My mind was running and my fast fingers wouldn’t stop typing.

From Targeted Attack to Rapid Detection

From Targeted Attack to Rapid Detection

Yesterday I was hit with a targeted phishing email that was incredibly good. The email was terse and had a 7 hour time window for which I needed to open the attachment and verify the invoice. The attachment was named after me and even came from a valid business domain.  Simple yet effective and no broken English. It looked good, minus one thing….nobody ever wins free money and if you want me to send you money, I’m sure you will call me and not password protect my invoice.

I’ve been extremely cautious with email based on the recent Gmail phishing data-uri technique, and this email fell on me while I was alert. What to do? Turn this email into a handful of signals and feature vectors for JASK, lets get to it.

 Step One: I searched the web for a match of the file hash and nothing came up. Not surprising. Still, evidence is evidence and I put this into JASK as a piece of threat intel.

Step Two: Using oledump I checked out what might be inside.

No macros were showing up in the file. Maybe that’s because it’s password protected? I’m no Word file or oledump expert and my goal is to quickly transform this to actionable intelligence. I tried unzipping as well and received what looks like a corrupted file message:


Since EncryptedPackage seemed like an interesting string in the file, I decided to start focusing on detecting encrypted Word Doc files in JASK with a Yara signature.

I settled on the EncryptionTransform piece in the file and pulled out the hex for its equivalent:


My goal would be to detect Word files with EncryptionTransform in them. Maybe I’ll get a lot of false positives, but I’ll let JASK handle the decision-making process.

Step 3: Write the Yara signature for detecting encrypted Word files:

I originally predicted this file would have a malicious macro in it and I wanted to find word files with macros in them. A yara signature was already floating on the web and can be read about in the link provided. This saved me some work on writing that signature.

Step 4: One last piece of evidence was the SMTP headers from. I figured why not, the more evidence I can pile into JASK the better. First, I would prototype something in notebooks and see what my SMTP meta-data fields looked like. The headers.mailfrom field I would search for my phishing attempt sender. To protect the compromised business and users email and prevent more spam or targeted phishing attempts, I’ve replaced some IP’s and email addresses with my own for this screenshot.

The results of this modified query searching for the from address, results show six other hosts on our network received a number of emails from this specific sender.  A possible sign of emails that hit the spam filter or other security devices and maybe just my slick well crafted email made it through.

Once I have my query completed, I can quickly turn this headers.from address into a pattern and give it an initial weight and kill-chain attribute for JASK to use.

I’m done. I’ve added a handful of feature vectors for identifying this phishing attempt. I casted my net wide to create signals matching macros and encrypted Word documents and some specific signals to match file hashes and the sender related to this specific attack.

Lesson of the day? Don’t sleep on intelligence. If your users are going to get phished, you need to rapidly turn as many features of that attempt into actionable intelligence to have an early warning next time.

On the Hunt - Threat Hunting with Base64 Decoder

Every now and again you hit a day where you just feel like scrolling. One of those lazy, rainy days just before the holidays. Today is one of those days and that's where my less efficient threat hunting ideas come from. Today I'm playing with extracting Base64 strings from HTTP URI's, HTTP Cookies, and just about anywhere I can find Base64 strings in a network feed. Let’s get to it!

The first thing we need is to write a Base64 extraction function;I need some coffee this morning and need one massive brain push for this trick. The goal is to search for any strings that look like they could be Base64. Accept this regex as elementary and not the "best of the best" for  Base64  string detection, it's our quick start to prove a hypothesis that something is hiding in our network via Base64.


Breaking our Regex down


This is the lookback string for an equal sign that represents the start of a Base64 string. The reason it's a look back is the decodeBase64 function needs a 4 byte string and the = sign doesn't need to be extracted from the full string.


This is matching any sequence of letters and numbers occurring any number of times. This is likely where the most improvement can be made in my regex. Maybe on a sunnier day.


This matches two equal signs to show the end of a Base64 string.

Now that we have a regex, let’s test it and find lots of great matches for Base64 encoded strings.

Part One of our task is completed. We've  built a Base64 string detector. Apply this function to a network data stream and now we are matching and displaying  HTTP URI's with Base64 inside of them.

Part Two is extracting these Base64 strings. It's one thing to simply find them. The real trick for me was extracting only the Base64 string within the URI. The flexibility of JASK is perfection for this task and we can utilize Spark to write an extraction function. Let's get to it! We define the variable pattern as our previous regex and build a function to extract the Base64 string that matches this pattern.

Now we are cooking with bacon! We have our getBase64 function for extracting Base64 strings registered as a UDF to use anywhere in our notebooks. Now we need a Base64 decoding function. I'm lazy today and it's raining, let me see if there isn't already a function for this. Got it! I'm going to import scalaj.http.Base64 and call it my lucky day. Remember we are being lazy hunters today, time to register this as a UDF.

Job done! Now I can call my getBase64 extraction function first and feed the results to our decodeBase64 function and it will return the Base64 decoded string. That's it! Now let's do this at MASSIVE SCALE!

The results are fun. We've found process tracking, device fingerprinting and plenty of ads pulling email addresses of users logged in. An Interesting (disgusting) way of user-id tracking. I also applied our function to the HTTP Cookie data and found a different set of fun findings, more interesting than you would expect, but I'm going to keep that between JASK and our affected customer.

Here’s a quick screenshot of raw results from the last day:

Rainy day threat hunting, testing a hypothesis and having fun. Lots of scrolling through results, which is exactly what I felt like doing on this lazy rainy day in San Francisco. I've committed this Base64 Decoding notebook to our JASK clusters for customers to take advantage of, so please come join us! I also converted some of what we've found to signals for our Ai to learn from. The holidays are almost over and I’m ready to go back to work!


From Big Data to Beautiful Data: Bridging the gap from Threat Hunter to C-Suite graphs with Zeppelin notebooks and D3

screen_shot_2016-11-22_at_9-20-30_am_720In my previous posts we worked through a number of Threat Hunting queries and data mining ideas. In the end we left off with how to demonstrate and translate value to the C-Suite. This has lead me into the realm of presenting data in beautiful ways. At JASK, customers access big data with Zeppelin notebooks, but Zeppelin begs for better implementations of beautiful data, providing only a small number of graphing types. A pie chart and a bar chart are not going to cut the mustard when demonstrating value up the chain. Cue D3 ( and its infinite flexibility in displaying beautiful data.

Working on the cluster from one of our research sensors at a very large Tech University, we’ve written a function to parse Top Level Domains (The .com, .org, .net portion of a URL). Using the function we query our data for the TLD and search for suspicious TLDs in HTTP request headers. Here is the code where we apply our TLD UDF (spark) definition to the dataset.



This query results in your standard big data row/table type of result. (Something an analyst might consume)



Now it’s time to start the Beautiful Data transformation! (Something the C-Suite can consume)

Here we are printing html and javascript within a zeppelin notebook against json data output. Instead of staring at rows and columns of big data, beautiful data translates up the management stack and helps tell a clearer story of the threat hunter’s findings.

The write once and use forever concept, works wonderfully with Zeppelin + D3. In this example we graphed TLDs, but we could easily represent a different Threat Hunting dataset with this graphing method. Graphing makes it easy for everyone to see the most frequently visited TLD and the least frequented TLD and that’s the job of beautiful data. Once more we’ve applied the same TLD notebook to all of our customer’s clusters to experience their own Beautiful Data.

Threat Hunting with your hands tied - This is Big Data Part II


Threat hunting isn’t only about finding compromised assets, it’s also performing the predictive function of finding the holes a malicious attacker might take advantage of. As I mentioned last week, your customers are your best hunters, accessing your website in a million different ways, with a thousand different web browsers and hundreds of different types of devices. This doesn’t include the automated mass vulnerability scanners, such as Shodan or research projects like MassScan that are scrubbing your applications as well. Today I’ll share some of my queries and I hope you share some of your most recent hunting exercises and queries with me.

At JASK we utilize Hadoop and Zeppelin notebooks. This allows us to write functions in spark and query our data using spark-sql syntax. This also allows us to export notebooks in json to share with the security community, work with our customers and the threat hunting community to build even more powerful notebooks and applied research. Now onto the data.

Searching for DNS non-authoritative answers for customer domains:

The results showed a large number of hosts querying the internal DNS server for Example: The internal DNS server did not have a record for this, so the query would then be forwarded to an external DNS server.  This looked strange and we realized this misconfiguration would point all users to their CMS licensing manager page since this particular domain was not registered under their license. I would categorize this as information disclosure, resulting in disclosing the CMS server version and dropping everyone to the admin login page of the CMS (both internal and external users). From this information disclosure it turns out they were running a vulnerable CMS version as well. Were they exploited yet? We had been in this POC for a few weeks and can query our data to determine if anyone accessed the CMS admin page while we have been in place. We are also able to close the loop and write a rule to produce a signal for logins to the admin page. Often times the business will decide this is not a risk and we simply keep it in our hunting notebook.

The zeppelin paragraph:

SELECT src_ip.addres
FROM dns
WHERE authoritative != trueandquerylike""GROUPBY

Building on the CMS information disclosure story we mentioned earlier. Here’s the query we used to perform a historical check and determine if anyone had accessed the vulnerable CMS.

SELECT src_ip.address,
FROMhttpWHERE request.uri 
like"%CMSSiteManager%"or request.uri 
src_ip.address notlike"192.168.%"

Non-Standard software - User-Agents:

Most of the customers I’ve worked with function like the wild west, with BYOB and no managed software or hard and fast policies. Every now and again you get an easy one where the customer maintains an approved software list and possibly even an approved web browser. This makes for easy anomaly hunting or “Never have I seen X” type hunting. If we see anything that does not match the customers “approved” user-agent, we have a finding worth chasing.  Below is a sample query, but usually you’ll add more to the query, an internal subnet to hunt or regex of acceptable user-agents. Below is a sample of a basic Zeppelin paragraph, I will leave the rest to your own imagination and hunting specific hunting exercise. Here we are looking for all IE 11 User Agents. This is to get your mind thinking, but this one is fairly simple for this post.

SELECT src_ip.address,dst_ip.address,
FROMhttpWHERE request.headers['USER-AGENT'] != " Mozilla/5.0 (compatible; IE 11.0; Win32; Trident/7.0)"FROMhttp

Maybe you just want to see what your TOP 10 Most popular User-agents are?

SELECT request.headers['USER-AGENT'],
FROMhttpGROUPBY request.headers['USER-AGENT'] 

Maybe you just want the distinct User-Agents in your network? This query has found me anti-virus agents fetching update lists and validating the license key through a base64 encoded User-Agent string. Lame…


None of the above queries are all that efficient and depending on how tight lipped the network is the more clarity these queries can provide. Nesting queries can help clean the results and mean the difference between having a threat hunter analyze 100 results or 1,000’s.

Wasting your time searching for ad-trackers?

I’m not aware of what can be done here short of our government stepping in to protect our privacy and this hasn’t bore me much fruit in a hunt. It has found me people accessing inappropriate content in the workplace. Even while the organization had invested in a web proxy and end-point software to prevent adult content in the workplace. We could use this to validate the effectiveness of those automated content blocking tools and web proxies. Ad-tracker’s give up a lot of information about the quality of the website you are accessing and you just might find this query bearing fruit for you to find users searching websites in “poor” taste for the workplace. I find the more deceptive the ad-tracker, usually the dirtier the website. Here’s one of the most common ad-tracker’s I’ve seen recently.

FROMhttpWHERE request.headers['GET'] like""

Searching for plain text passwords floating around.

This one can be a bit noisy, so make sure to tighten it up after you scrub your first round of results with a few “not like” statements. We’ve found poor business applications with hardcoded passwords crossing the network boundary and floating around internally.

select src_ip.address,
from http 
request.uri like "%password%"

Searching for plain-text Protocols:

We all promise plaintext protocols are not allowed on the network, but we always find them. How about we take a look at the types of FTP activity happening and the exact commands that were run? One piece of information against logs for hunting. If you don’t control the FTP server, do you think the FTP server is going to send you the logs? This is the type of hunting that MUST be done with network data. Log data is a ho-hum source for hunting, maybe you have it, maybe you don’t. You just don’t know if you are getting the true results with logs, you never know which servers are logging. Sometimes the servers running are not yours, but a service a user throws up to get their job done quickly. That was the case with one of our most recent hunting exercises finding a quickly stood up FTP server on the internal network.

SELECT src_ip.address,

Maybe you are searching for anyone using those pesky Dell or IBM superfish root * certificates? This is just a dabble into the power of hunting based on TLS certificates, the cipher being used, and more. I’ve yet to find anything in a customer network related to weak ciphers or export encryption and that’s a good sign. TLS parameters are easy to hunt for and you should do it. It’s not always about what your certificates look like, but the certificates of the sites your users are interacting with. This might be the case with encrypted malware and TLS encrypted botnets using self-signed certificates or misconfigured certificates. Hackers make mistakes and it’s your job to catch their mistakes. They are doing a good job at catching ours.

select * 
from tls 
subject like"%edell%"

The story goes on forever, are you focused on the perimeter and want to see any connections that were established from external to internal? We remove RFC 1918 space in this query. As we graduate our knowledge in Spark we begin to define variables utilize functions, but for this article you’ll see no variables are used and we simply code the customer’s used RFC 1918 private addresses into the query.

SELECT src_ip.address,
FROM flows 
WHERE conn_state = "S1"and dst_ip.address like"172.%"and src_ip.address notlike"172.%"and src_ip.address notlike"192.168.%"and month = month(current_timestamp()) 
GROUPBY src_ip.address,dst_ip.address,dst_port,conn_state 

Still loving DNS and want to see your top 10 DNS queries? Your domain will likely be the top hit, go ahead and set it as a “Not like” and keep paring down those not like statements for a personal fit. Remember this is a write once, run many times hunt. Investing your time to write good queries the first time will result in a more efficient and quicker hunting exercise in the future.

FROM dns 
WHEREquery != ''andquerynotlike''GROUPBYqueryORDERBYCOUNT(query)

Have any ugly buggers trying to perform DNS exfiltration? Try searching for DNS queries of long length. This is a pretty weak one and almost every hit ends up with spotify’s long DNS queries for playlists.

SELECTqueryFROM dns 
WHERELENGTH(query) >= 100andquerynotlike""

Weak Kerberos Ciphers?

RC4-hmac and DES are seen on Windows XP and up to Windows 2003 servers. It’s something most environments should be moving away from for obvious weak cipher reasons. This query is great for validating strong ciphers are used throughout an environment and calculating the risk associated with where these weak ciphers are occurring in your network.

FROM kerberos 
cipher like"%rc4%"or 
cipher like"%des%"

Finally, let us not forget the world of executables. Those hundreds of thousands of dollars spent on full packet capture devices for the sole business purpose of extracting executables. Save yourself:

select src_ip.address,
from file 
group by src_ip.address,dst_ip.address,hash.sha256,mime_type

That’s a small sample of the 100’s of queries, paragraphs, and notebooks we’ve built at JASK for our customers to jump right into hunting in Big Data. We prefer to organize these queries into focused notebooks, such as DNS Security, HTTP, and TLS notebooks and run them at the notebook level vs. paragraph level, adding tremendous value and efficiency to a threat analytics program.

What to do with the results and wrapping up the Hunting Exercise.

Results are nothing if you can’t wrap them into the business process. When the hunting exercise is complete, take your query and turn it into signal intelligence to drive Artificial Intelligence. In JASK we have a rule engine for this exact design. Teach JASK a new skill and the AI becomes smarter. No security detection technology will catch everything, but when humans, customers, the data science, and security community are able to continually improve detection through hunting exercises and close the loop, we are one step closer to defending the business and turning hunting exercises into a repeatable process.

Happy Hunting!



Threat Hunting with your hands tied - This is Big Data Part I

The Stage:

When walking into a Fine China shop, you can look, but Do Not Touch! This concept applies in a customer Proof of Concept; you can't influence the infrastructure or applications, you can't review the website or encourage an application to disclose its version or variables to expose its vulnerabilities. It's the mother of all challenges, one I live with everyday when working at JASK. Welcome to Threat Hunting with Big Data Science where the rules are clear – DON'T TOUCH.

The Measurement:

In the world of AI driven cyber security, it takes time for technology to learn the network and listen for threat signals reaching a noise level worthy of human interaction. Just as a large city such as San Francisco, CA would not rank its safety on the number of tickets issued, AI driven cyber security cannot base the number of alerts generated as a success metric. The number of events generated is not a metric any efficiently running SOC should accept as a measure of its health. While time is taken for AI to solve the challenge of learning the network, what functions remain for the SOC personnel and SE to do?

Thankfully that’s Big Data; the gold we are panning for sits within and the coal that keeps the fire burning is continually produced. In a Hadoop and Spark backed platform, the questions come as fast and fluid as the answers. It's threat hunting with your hands tied, Big Data Science meets Signals Intelligence with network data. The underpinnings of Spark and Hadoop build a base for an AI driven platform and a big data hunting ground. The data is exposed through Zeppelin notebooks, making it the perfect playground for threat hunting and the moment my job gets interesting. The blinders are taken off and we press ‘Play’ on the notebooks.

The Goal:

"Everyone is compromised" right? That is what has been preached more than a decade and what we are still told today. With this mindset, you would expect that in a POC you would find something bad, compelling you to purchase bad. Unfortunately for my bank account, the reality is that while everyone is compromised (and it's relatively easy to locate a compromise), how large of an impact will it have on the organization? It seems the “Everyone is compromised” statement mostly addresses trackers and adware 99.99% (four-nines) of the time. The monetization of employees via adware isn't something a CISO prioritizes as a high-risk to the business. You have to dig deeper to make the payday and that's when the real hunting begins. I would likely modify the phrase "Everyone is compromised" to “Everyone is critically compromised at some point in time.” The job of threat hunting isn’t to just detect a threat, but to analyze and predict threats the company classifies as high risk.

The Hunt:

Threat hunting is about letting the network tell us where to look. When looking at network data, we see DNS authoritative answers for non-authoritative domains and top DNS queries for non-internal assets. We verify strong TLS ciphers are being used throughout the enterprise, drill down with a focus on web server response codes, request headers, response headers, suspicious user-agents on internal assets, and analyze the network data for how a business' customers interact with the websites and applications both internal and external. Do we see fast flux domains? Do we see rapid queries? Do we see suspicious executables (those hidden within zip files) or file transfer methods? Do we see an excess of SMB, RDP, or authentication protocol traffic? The questions we are able to ask Big Data are limitless and the "Big Data Lips Don't Lie".

The Discoveries:

These Big Data Queries perform the predictive function of viewing how a customer's internal, external, good and bad users interact with the business. We don't have the ability to touch an internal asset or application and influence the results, however, every business has customers. Whether it is an employee or external user, these personnel are hands-on performing the pentest. You may hire a “professional” pentest once or twice a year, but the reality is we can never predict with 100% certainty how customers will interact with the applications. Where is the company accidentally exposing itself and how do you determine how at risk your company is?

Part II: The Results (Coming Soon)


Why are we using logs to do the networks job?!


Why cook eggs on a glass stove instead of using the non-stick pans in the cupboard? Sure it’ll cook the eggs, but it is not the proper tool for the job. So, why is the SOC using endpoint logs to gain the visibility the network provides? Clearly someone forgot about what’s in the kitchen. Why has the SOC spent the last decade forcing the SIEM to do the job of network tools? To get technical, why are my Linux guru’s using auditd to monitor sockets? (Talk about not using the right tools!)

The formal Kill Chain model as described by Lockheed Martin consists of 7 stages. Different vendors butcher each of the stages to their benefit, but let’s start with the Kill Chain as Lockheed described (and ultimately has the copyright for). Recon, Weaponization, Delivery, Exploitation, Installation, Command & Control, and finally Stage 7 - Actions on Objectives. Analyzing the seven stages, we find that only TWO (2!) stages do not traverse the network.

Don’t believe me? Well here is your proof:

Stage 1- Active Reconnaissance. This is when a remote attacker MUST cross the internet. It’s as simple as that, this isn’t a log event, this is network communication. If this were monitored the way many organizations attempt to leverage their SIEM and Log environment using endpoints, that knowledge would be replicated by every host that was involved in the reconnaissance event. Why deploy agents and monitor logs on an entire /21 [AS1] to capture a port scan when a network based sensor could monitor at one point in the network and see the entire Reconnaissance phase?

Stage 2 – Weaponization. That’s the weakness of every solution, endpoint based or network. It’s also the stage that most vendor’s cut out of their solutions messaging because it’s what happens in the attacker’s basement. It’s the stage where the cyber-criminal builds an exploit based on evidence found in the Recon phase before sending it to your devices. Moving quickly to

Stage 3 – Delivery. Guess what? In order to deliver a package to grandma the UPS truck has to drive on the highway. Similarly, the cyber-criminals exploit must pass the information super-highway, also known as the network. So why in the world is the SOC monitoring endpoint logs to gain second hand information that’s the networks first-hand knowledge? I’m befuddled by the complexity the SIEM vendors have bestowed upon our poor SOCs, aren’t you?

Stage 4 – Exploitation. Show me some endpoint love! Finally, we find a proper location for end-point monitoring. When the magic package lands on the endpoint and is executed, there is no better place to monitor the outcome than the endpoint itself. Thank the syslog-ng lord [AS2] for endpoint forensics and logs. Are you ready for another perfect task for endpoint monitoring?

Stage 5 – Installation. When it’s time to install malware, it’s time to touch the endpoint again. That’s two stages out of five so far that logs are actually the correct tool for the job.

Stage 6 – Command and Control. Getting back to the network; When an attacker in Guangdong, China wants to control his botnet in San Francisco, California, there’s a solid guarantee it’s going to be over the internet, that is unless the attacker plans on taking a cargo ship to the Port of Oakland and has a BART train pass to get to my office, walks up the stairs to my computer, and left clicks my mouse. Monitoring for Command and Controls with endpoint logs? Are you kidding? Are you really going to get on that cargo ship? It’s like eating spaghetti with a spoon. Sure it gets noodles into your mouth, but most slip back into the data lake of logs.

Stage 7 – Actions on Objectives. Finally, let the network come back to light! Guess what? Unless your cyber-criminal once again plans on taking PTO time to board that cargo ship and visit you to steal your documents, data exfiltration almost certainly will cross the network. For what bloody reason are we monitoring sockets with auditd for this? My brain hurts watching 90% of SOCs around the world leveraging logs in the SIEM to detect everything and then wonder why everything is failing!

We get the point, more logs isn’t going to cover the gaps that network sensors were built from birth to cover. Now please, stop using logs to do the networks job. [End Soapbox.]