The Art and Science of Network Anomaly Detection – Part 1

Anomaly is an often-used term to indicate that something is deviating from normal. However, understanding what anomalies are and how to detect them is anything but easy. This is especially true when it comes to classifying and detecting anomalies in a network.

In the first part of the two-part blog series, we will attempt to define what network anomalies are, identify their causes and classify them using various criteria.

What are Network Anomalies?

Network anomalies can be defined as instances where metrics deviate from a normal threshold or circumstances where events and traffic patterns deviate from normal behavior.

Why should we care about Anomalies?

Anomalies in a network are indicative of a problem that can affect the robustness of a network’s operation. Whether the root cause of an anomaly is a potential security attack waiting to happen or a congested link that degrades user experience, anomalies need to be detected and mitigated timely and properly.

Classification of network anomalies

Simply stating that anomalies are deviations from normal behavior is not very useful. Different types of anomalies exist in the network and classifying them helps to build a robust anomaly detection model. While there is no formal classification of network anomalies, insights from various literature and analysis of network traffic data and behavior provide an informal classification. Broadly speaking, they can be classified using the following criteria:

By where they occur

Anomalies in a network can occur in a network device or in network traffic

An anomaly that occurs in a network device can be hardware anomaly that indicates a hardware about to fail, a software anomaly that is a security vulnerability or a bug causing the device to malfunction.
On the other hand, an anomaly could occur in network traffic which can manifest itself as change in volume or change in traffic behavior.

Based on the cause of the anomaly

Anomalies in a network are typically caused by either network failures or security threats/attacks.

Network connectivity failures could produce anomalies in network device metrics (e.g., High CPU usage), logs (e.g., login timeouts) or network traffic behavior (e.g., TCP handshake failures)
Network congestion and Denial of service attacks can cause deviations from normal traffic volume (e.g., High traffic volume at a certain time of the day, High traffic volume to an IP address in short time)
Performance degradations can cause abnormal traffic behavior (e.g., Round Trip Time or TCP Retransmissions above baseline)
Security threats can cause suspicious or malicious behaviors in network traffic (e.g., port scanning can result in small packets being sent repeatedly)

Based on the characteristics of the anomaly

Not all anomalies are created equal and based on the characteristics they exhibit they can be one of the following three types.

Point anomalies are deviations in a specific data instance. For example, An event where a unauthorized host tries to access a server. While there are a few cases of point anomalies in the network, most of them are not.

Collective anomalies are deviations in a group of data instances. An individual instance within a group is not anomalous on its own. For example, a related pattern of data instances – More than 10 Duplicate TCP ACKs after a successful handshake, an alert from an IDS correlated with user authentication logs – can be an anomaly.

Context anomalies are data instances that are anomalies only in certain contexts. The context can be temporal (i.e., time-based), spatial (i.e., location-based), or behavioral (i.e., based on certain network traffic behavior patterns). Network traffic anomalies are almost always contextual which makes them harder to detect by just looking at the data.

In a temporal context, anomalies in data are only valid in the context of time or time-related concepts such as weekdays, weekends, working hours and maintenance windows. For example, having a surge in traffic at 2AM when there is a routine backup is not an anomaly, but if it happens at 11AM it could be one.
In a spatial context, anomalies in data are only valid in certain locations or devices. For example, congested links in WAN can be rerouted to redundant links but may be an anomaly in a remote access site.
In a network behavioral context, data is an anomaly only if it deviates from certain established behaviors. These behaviors can be defined based on network usage profiles for specific hosts/IP address, different applications (voice, streaming video) and application protocols (HTTP, LDAP etc.) and changes in frequency of events or trends.

Why should you classify network anomalies?

Many systems produce a multitude of data and alerts that could indicate an anomaly. Without a system to classify network anomalies, detecting anomalies becomes infinitely harder resulting in numerous alerts most of which will be false positives. This routinely causes network engineers and security analysts to ignore the alerts or search for a needle in a haystack.

In the next blog post we will address the challenges of anomaly detection, the applications and implementations of anomaly detection in network and security and commonly used statistical and machine learning based anomaly detection methods.