From Big Data to Beautiful Data: Bridging the gap from Threat Hunter to C-Suite graphs with Zeppelin notebooks and D3

screen_shot_2016-11-22_at_9-20-30_am_720In my previous posts we worked through a number of Threat Hunting queries and data mining ideas. In the end we left off with how to demonstrate and translate value to the C-Suite. This has lead me into the realm of presenting data in beautiful ways. At JASK, customers access big data with Zeppelin notebooks, but Zeppelin begs for better implementations of beautiful data, providing only a small number of graphing types. A pie chart and a bar chart are not going to cut the mustard when demonstrating value up the chain. Cue D3 ( and its infinite flexibility in displaying beautiful data.

Working on the cluster from one of our research sensors at a very large Tech University, we’ve written a function to parse Top Level Domains (The .com, .org, .net portion of a URL). Using the function we query our data for the TLD and search for suspicious TLDs in HTTP request headers. Here is the code where we apply our TLD UDF (spark) definition to the dataset.



This query results in your standard big data row/table type of result. (Something an analyst might consume)



Now it’s time to start the Beautiful Data transformation! (Something the C-Suite can consume)

Here we are printing html and javascript within a zeppelin notebook against json data output. Instead of staring at rows and columns of big data, beautiful data translates up the management stack and helps tell a clearer story of the threat hunter’s findings.

The write once and use forever concept, works wonderfully with Zeppelin + D3. In this example we graphed TLDs, but we could easily represent a different Threat Hunting dataset with this graphing method. Graphing makes it easy for everyone to see the most frequently visited TLD and the least frequented TLD and that’s the job of beautiful data. Once more we’ve applied the same TLD notebook to all of our customer’s clusters to experience their own Beautiful Data.

Gigamon brings deep packet inspection to Amazon cloud

Gigamon Inc. is bringing on-premise-like network visibility to the Amazon Web Services cloud with a data-in-motion visibility platform that enables information technology organizations to conduct deep packet analysis on cloud workloads.

Source: Silicon Angle

Read Full Article Here


Why We Picked Tensorflow for Cybersecurity


When I started in security analytics several years ago, the choice of tool and platform was typically dictated for you, usually based on earlier investments the company had already made. These days, scientists have the opposite problem: a dizzying array of tools in a variety of licensing modes.  The frustrations of limited toolsets have been replaced by the anxiety of choice. As wonderful as unlimited options may seem, in reality we must limit our options in order to be successful. Ideally, an organization can converge on a single choice: not perfect, but one that allows maximizing benefit while decreasing the challenges of maintenance.

At JASK, we have chosen a toolset that we think does that: Google Tensorflow.  At a high level these were the reasons:

  • Data Science needs a toolset that can take advantage of either CPU’s or GPU’s, or a mix of them.
  • A product for model building must recognize that the best language for modelling is not the best language for algorithms.
  • The experiences of local development and cluster development should be the same

We need more cowbell.

It seems intuitive to use as much processing power as a piece of hardware offers; unfortunately, we rarely have this option.  Most notebooks and workstations either have a combined GPU/CPU on board (not always NVidia), and high performing GPU’s are a special-option only on most servers. On the other hand, while a GPU is fantastic at certain problems (matrix multiplication, for example) no class on GPU programming would tell you to do everything on a GPU. If you did hear this in a class, I recommend a supplement Heterogeneous Parallel Programming.  Tensorflow meets this requirement: I can develop on a laptop with no GPU’s, then run the same node on a cloud instance with an array of GPU’s installed.

A statistician and a mathematician walk into a bar …

Back at University, Computational Finance and Applied Mathematics shared some faculty, even attended the same graduation ceremony.  Yet, all their coursework was in R and ours was in Matlab, which I think is the most concise illustration of model vs algorithm building in terms of software tools. Here’s another one: some believe in having a minimal knowledge of each algorithm’s inner workings and a wide view of all the possibilities and available tools, while others believe in the need to understand fewer algorithms but deep enough to program them yourself. I now have a theory for a likely reason behind this: your position on the spectrum I described, is a function of how much hate and fear you have for C and C++ programming.  To unite these examples, the Quant’s and the Amath’s both knew python, and to take advantage of decades of numerical optimization you have to do it in C (or let’s face it, Fortran). ML solutions must be built on something that can bridge these two worlds: Tensorflow’s Python code for the model, which is compiled into C builds that bridge.

Anyone know a pop culture reference about parallel programming? 

As much as I would like every data scientist in the world to have their own Hadoop cluster, we know that’s not going to happen. Also, in line with Moore’s law, today’s laptop surpasses the main frame I helped my father load punch cards into when I was little. Doing your development on clusters is expensive, and debugging and testing become problematic as well. I have found that I am more willing to give up some application performance than to pay the price of easy debugging and testing. I find that with some education, data scientists can be persuaded to do their development with “small data”, and we can treat cluster paralleling and performance in a separate step. The ability to develop, test, and run on a local machine and then treat parallelization as a configuration step is a very nice thing about Tensorflow.

Does Tensorflow have everything we need?   While baked-in visualization and a large user community are very beneficial, I would trade that for a tool that ran GPU’s from different vendors in a heartbeat. And while it was our choice, there are other good ones to evaluate for yourself.  Your mileage may vary, when deciding whats the best tool for you, I recommend also looking at Theano, DSSTNE, and sklearn to see if they are a better fit for you.

But as a team, you have to start somewhere, and my experience has shown that “somewhere” should be somewhat close to what it will look like in production, and something that has enough capability so that you are not limited greatly or required to have 50 different software packages for 50 problems.



Can Hackers Be Stopped? The State of Defense in the Private Sector

One week before the recent massive hack attack shut off access to Twitter, PayPal, Airbnb and dozens of other major websites, I was at an off-the-record conference with leaders of some of the country's biggest companies, discussing cyberthreats. Like soldiers in one of the landing crafts approaching the beach on D-Day, the CEOs seemed resigned to their grim fate. A destructive attack was inevitably going to rip through some, if not all, of them. They felt sorry for themselves and one another.And most weren’t even imagining how bad it’s going to get. IBM CEO Ginni Rometty has said

And most weren’t even imagining how bad it’s going to get. IBM CEO Ginni Rometty has said cybercrime is today’s greatest threat to global business, apparently putting it ahead of nuclear war, climate change or an alien invasion.

Source: Newsweek

Read full article here

Threat Hunting with your hands tied - This is Big Data Part II


Threat hunting isn’t only about finding compromised assets, it’s also performing the predictive function of finding the holes a malicious attacker might take advantage of. As I mentioned last week, your customers are your best hunters, accessing your website in a million different ways, with a thousand different web browsers and hundreds of different types of devices. This doesn’t include the automated mass vulnerability scanners, such as Shodan or research projects like MassScan that are scrubbing your applications as well. Today I’ll share some of my queries and I hope you share some of your most recent hunting exercises and queries with me.

At JASK we utilize Hadoop and Zeppelin notebooks. This allows us to write functions in spark and query our data using spark-sql syntax. This also allows us to export notebooks in json to share with the security community, work with our customers and the threat hunting community to build even more powerful notebooks and applied research. Now onto the data.

Searching for DNS non-authoritative answers for customer domains:

The results showed a large number of hosts querying the internal DNS server for Example: The internal DNS server did not have a record for this, so the query would then be forwarded to an external DNS server.  This looked strange and we realized this misconfiguration would point all users to their CMS licensing manager page since this particular domain was not registered under their license. I would categorize this as information disclosure, resulting in disclosing the CMS server version and dropping everyone to the admin login page of the CMS (both internal and external users). From this information disclosure it turns out they were running a vulnerable CMS version as well. Were they exploited yet? We had been in this POC for a few weeks and can query our data to determine if anyone accessed the CMS admin page while we have been in place. We are also able to close the loop and write a rule to produce a signal for logins to the admin page. Often times the business will decide this is not a risk and we simply keep it in our hunting notebook.

The zeppelin paragraph:

SELECT src_ip.addres
FROM dns
WHERE authoritative != trueandquerylike""GROUPBY

Building on the CMS information disclosure story we mentioned earlier. Here’s the query we used to perform a historical check and determine if anyone had accessed the vulnerable CMS.

SELECT src_ip.address,
FROMhttpWHERE request.uri 
like"%CMSSiteManager%"or request.uri 
src_ip.address notlike"192.168.%"

Non-Standard software - User-Agents:

Most of the customers I’ve worked with function like the wild west, with BYOB and no managed software or hard and fast policies. Every now and again you get an easy one where the customer maintains an approved software list and possibly even an approved web browser. This makes for easy anomaly hunting or “Never have I seen X” type hunting. If we see anything that does not match the customers “approved” user-agent, we have a finding worth chasing.  Below is a sample query, but usually you’ll add more to the query, an internal subnet to hunt or regex of acceptable user-agents. Below is a sample of a basic Zeppelin paragraph, I will leave the rest to your own imagination and hunting specific hunting exercise. Here we are looking for all IE 11 User Agents. This is to get your mind thinking, but this one is fairly simple for this post.

SELECT src_ip.address,dst_ip.address,
FROMhttpWHERE request.headers['USER-AGENT'] != " Mozilla/5.0 (compatible; IE 11.0; Win32; Trident/7.0)"FROMhttp

Maybe you just want to see what your TOP 10 Most popular User-agents are?

SELECT request.headers['USER-AGENT'],
FROMhttpGROUPBY request.headers['USER-AGENT'] 

Maybe you just want the distinct User-Agents in your network? This query has found me anti-virus agents fetching update lists and validating the license key through a base64 encoded User-Agent string. Lame…


None of the above queries are all that efficient and depending on how tight lipped the network is the more clarity these queries can provide. Nesting queries can help clean the results and mean the difference between having a threat hunter analyze 100 results or 1,000’s.

Wasting your time searching for ad-trackers?

I’m not aware of what can be done here short of our government stepping in to protect our privacy and this hasn’t bore me much fruit in a hunt. It has found me people accessing inappropriate content in the workplace. Even while the organization had invested in a web proxy and end-point software to prevent adult content in the workplace. We could use this to validate the effectiveness of those automated content blocking tools and web proxies. Ad-tracker’s give up a lot of information about the quality of the website you are accessing and you just might find this query bearing fruit for you to find users searching websites in “poor” taste for the workplace. I find the more deceptive the ad-tracker, usually the dirtier the website. Here’s one of the most common ad-tracker’s I’ve seen recently.

FROMhttpWHERE request.headers['GET'] like""

Searching for plain text passwords floating around.

This one can be a bit noisy, so make sure to tighten it up after you scrub your first round of results with a few “not like” statements. We’ve found poor business applications with hardcoded passwords crossing the network boundary and floating around internally.

select src_ip.address,
from http 
request.uri like "%password%"

Searching for plain-text Protocols:

We all promise plaintext protocols are not allowed on the network, but we always find them. How about we take a look at the types of FTP activity happening and the exact commands that were run? One piece of information against logs for hunting. If you don’t control the FTP server, do you think the FTP server is going to send you the logs? This is the type of hunting that MUST be done with network data. Log data is a ho-hum source for hunting, maybe you have it, maybe you don’t. You just don’t know if you are getting the true results with logs, you never know which servers are logging. Sometimes the servers running are not yours, but a service a user throws up to get their job done quickly. That was the case with one of our most recent hunting exercises finding a quickly stood up FTP server on the internal network.

SELECT src_ip.address,

Maybe you are searching for anyone using those pesky Dell or IBM superfish root * certificates? This is just a dabble into the power of hunting based on TLS certificates, the cipher being used, and more. I’ve yet to find anything in a customer network related to weak ciphers or export encryption and that’s a good sign. TLS parameters are easy to hunt for and you should do it. It’s not always about what your certificates look like, but the certificates of the sites your users are interacting with. This might be the case with encrypted malware and TLS encrypted botnets using self-signed certificates or misconfigured certificates. Hackers make mistakes and it’s your job to catch their mistakes. They are doing a good job at catching ours.

select * 
from tls 
subject like"%edell%"

The story goes on forever, are you focused on the perimeter and want to see any connections that were established from external to internal? We remove RFC 1918 space in this query. As we graduate our knowledge in Spark we begin to define variables utilize functions, but for this article you’ll see no variables are used and we simply code the customer’s used RFC 1918 private addresses into the query.

SELECT src_ip.address,
FROM flows 
WHERE conn_state = "S1"and dst_ip.address like"172.%"and src_ip.address notlike"172.%"and src_ip.address notlike"192.168.%"and month = month(current_timestamp()) 
GROUPBY src_ip.address,dst_ip.address,dst_port,conn_state 

Still loving DNS and want to see your top 10 DNS queries? Your domain will likely be the top hit, go ahead and set it as a “Not like” and keep paring down those not like statements for a personal fit. Remember this is a write once, run many times hunt. Investing your time to write good queries the first time will result in a more efficient and quicker hunting exercise in the future.

FROM dns 
WHEREquery != ''andquerynotlike''GROUPBYqueryORDERBYCOUNT(query)

Have any ugly buggers trying to perform DNS exfiltration? Try searching for DNS queries of long length. This is a pretty weak one and almost every hit ends up with spotify’s long DNS queries for playlists.

SELECTqueryFROM dns 
WHERELENGTH(query) >= 100andquerynotlike""

Weak Kerberos Ciphers?

RC4-hmac and DES are seen on Windows XP and up to Windows 2003 servers. It’s something most environments should be moving away from for obvious weak cipher reasons. This query is great for validating strong ciphers are used throughout an environment and calculating the risk associated with where these weak ciphers are occurring in your network.

FROM kerberos 
cipher like"%rc4%"or 
cipher like"%des%"

Finally, let us not forget the world of executables. Those hundreds of thousands of dollars spent on full packet capture devices for the sole business purpose of extracting executables. Save yourself:

select src_ip.address,
from file 
group by src_ip.address,dst_ip.address,hash.sha256,mime_type

That’s a small sample of the 100’s of queries, paragraphs, and notebooks we’ve built at JASK for our customers to jump right into hunting in Big Data. We prefer to organize these queries into focused notebooks, such as DNS Security, HTTP, and TLS notebooks and run them at the notebook level vs. paragraph level, adding tremendous value and efficiency to a threat analytics program.

What to do with the results and wrapping up the Hunting Exercise.

Results are nothing if you can’t wrap them into the business process. When the hunting exercise is complete, take your query and turn it into signal intelligence to drive Artificial Intelligence. In JASK we have a rule engine for this exact design. Teach JASK a new skill and the AI becomes smarter. No security detection technology will catch everything, but when humans, customers, the data science, and security community are able to continually improve detection through hunting exercises and close the loop, we are one step closer to defending the business and turning hunting exercises into a repeatable process.

Happy Hunting!