She Blinded Me with Data Science: Why Machine Learning Varies Broadly for Network Security

By David Thompson | 11 July 2016

One great promise machine learning holds for the network security industry is the ability to detect advanced and unknown attacks, particularly those leading to data breaches. Unfortunately, machine learning is quickly becoming a popular marketing term and vendors use it in very different ways, thus obscuring whether real benefits are being introduced. Instead of arguing the basic jargon, it’s time to better define what machine learning can accomplish and why one would care.

Bold Step Forward or Repeating History?

Machine learning or data science is only as good as its application. Applied to the wrong problems or in the same ways security has been approached for the past 20 years will deliver the same results: excessive false positives, and the continued inability to detect critical network attack activity.

At the same time, a good application of machine learning could be a game-changer for security by potentially eliminating the false positive problem and bringing about a new level of fidelity and accuracy that could result in finding an active network attacker quickly before data theft or damage could occur.

How to Evaluate Machine Learning in Security

Rather than embark on an academic treatment of various machine learning techniques, such as regression, classification, neural networks, feature extraction, deep learning, etc., it would be more productive to focus on the practical applications. Basic journalist questions of who, what, where, when and how serve as a great framework for cutting through marketing claims and understanding how a product or service is utilising machine learning.

Who is performing machine learning—the vendor, the product or the user?

Who is performing machine learning? Is it the vendor’s research team and data scientists, or is the product itself performing the machine learning? This simple question alone can strongly differentiate product approaches.

In the vast majority of cases, the term machine learning actually describes one of the tools a vendor uses to develop their product or generate threat intelligence. Here the vendor is actually performing machine learning in their lab rather than the product doing it on premises. AV, URL filtering, and sandbox vendors use this type of machine learning behind the scenes, and have been for years.

Some products perform machine learning as an integral part of their function, typically for behavioural detection. In this case, the product “learns” the specific environment and uses that information for detection. Observing a user or machine starting to access resources it never accessed before might be a good example. There is no predetermined rule, signature or pattern that can reliably detect this. Only profiling normal behaviour in the particular network and applying that knowledge to detect anomalous behavior can achieve an accurate detection.

Machine learning may be used by end-users themselves, such as data scientists. For example, business intelligence tools help the end-user define datasets, run correlations, regressions and clustering algorithms. In this case the end-user directly utilises machine learning for a purpose. The end-user decides which data to process, what parameters to use and how to interpret the results.

What is machine learning being used to accomplish?

Look at what is being analysed. What are the inputs? If the inputs are simply labelled samples of objects such as known malware, files, registry changes, Indicators of Comprise (IOC’s), domains, etc., then the result will be an engine for detecting malware. On the other hand, if what is being analysed is from the production systems of an organisation (network traffic, endpoints, or logs), then the system is likely designed with something other than simple malware detection in mind.

Where is the machine learning being performed?

Analysis in a vendors’ lab indicates that the system isn’t going to be learning much about your network or users. Analysis performed on your site strongly implies the system will be learning and profiling some aspect of your environment. This is the key to learning or knowing what is normal on the network. With a small bit of feedback to the system, this can provide a baseline and provide anomaly based detection that highlights attack behaviour, rather than simply looking for more objects.

When is the machine learning being performed?

If learning takes place before the product is deployed, then clearly it is not learning anything about the live network where it is deployed. Learning in advance, by definition, must be learning only about objects that are already known to be malicious. Learning that happens after deployment is likely to be focused on the activities that are happening in real-time on your network, as your users go about their normal tasks.

How is the machine learning being performed?

Supervised machine learning techniques are those that take as inputs labelled (known) samples of malware and benign files. These techniques are best used to learn and thus describe and identify what they are trained on – to distinguish good from bad. Unsupervised machine learning techniques are those having no predefined set of already known examples. Instead the system must group (“cluster”) and infer knowledge. The devil is in the details here, because all decent machine learning implementations actually blend many (if not hundreds) of different models and can actually extract learned features from one problem space and sometimes even effectively apply them to an adjacent or similar problem. This is why the “how” is not the best mechanism for understanding whether a vendor’s approach will be beneficial or not. A basic understanding of at least whether the approach leans towards supervised or unsupervised, however, can help clarify whether the answers to the questions already posed in this article indicate a solution focused on the traditional model of security using statically-defined threats, or if it is tackling detection from a new angle.

Why does any of this matter?

Understanding the usage of machine learning matters because there is no hope in solving certain security problems with a tool trained on the wrong task! The next step might be to ask - what are the benefits of focusing on security based on known, statically defined threats versus a new approach that uses behavioural profiling to detect malicious anomalies, and then how do I tell whether a vendor is going to help me with one or the other?

Ultimately, the challenge in security today is the exploding complexity of our devices and systems combined with an inability to adequately detect and stop attacks. The key to unlocking this challenge isn’t through quicker identification of malware, but instead utilising machine learning to better understand our own networks, devices, and users and from that understanding dramatically speeding the process of identifying malicious actors (whether external or internal).

By David Thompson, Senior Director, Product Management, LightCyber.