Feature Engineering at the intersection of Network Data and Machine Learning

Let’s re-iterate a well-known fact. Network data is noisy.

Networking and security experts use information available in the network data to troubleshoot an issue or detect threats and attacks. From experience they know which data is valuable. In other words, they know how to filter the noise.

Let’s take this example, in a packet capture file, each packet has IP version number and packet checksum. They have little value because IP version number only has two values and they are fixed, while packet checksum is random. On the other hand, TCP flags provide valuable information.

The machine which is doing network troubleshooting or threat detection doesn’t have this knowledge. So, we must communicate our own understanding and knowledge about what information is important to perform a specific task.

How do we do that?

By doing Feature Engineering.

What is Feature Engineering?

Feature engineering is the process of extracting relevant information from existing data.

In this post, we will talk about different features extracted from network data, tasks, and techniques of feature engineering, and why feature engineering is essential to the accuracy of the machine learning model.

Types of features

Before we jump into feature engineering, let’s first define features in the context of network data.

Features are set of fields used to train a machine learning model for a specific use case (e.g., anomaly detection, traffic classification). These can be the individual fields (e.g., number of packets) or relationships between them (e.g., bytes per packet).

Features are categorical if they take discrete values (e.g., protocol) or numerical if they take continuous values (e.g., bytes)

Features extracted from network data can be classified as follows:

Basic Features

These are extracted directly from individual packet headers or event logs without any changes. For e.g., source/destination IP, protocol (TCP/UDP), source/destination port, TCP or IP flags.

Temporal Features

Each flow, packet, and log entry usually has a timestamp assigned to it. These are irregular timestamped events, as in they are timestamped based on when the flow or event occurred.

This data can be transformed into a regular time series data by slicing or grouping by different time periods. Network traffic is less affected by exact date and time of day (e.g., 9.20 AM) or day of the week (tuesday), so it is important to convert this data into regular timeseries rather than using the timestamps directly.

For example, we can take a netflow traffic collected for 24hrs, divide them into 1- minute intervals and calculate the number of TCP packets transmitted each minute. This will result in a time series with 1440 values (there are 1440 minutes in a 24-hour period). This is an example of a single variable timeseries. We can also generate multivariate time series that calculates multiple values such as TCP packets, UDP packets, total bytes.

Time series data also exhibit the following patterns, which can be inferred from converting to a regular time series.

Trend (whether traffic is increasing or decreasing over time and at what rate?)
Seasonality (whether the traffic shows time of day or weekday/weekend variations)
Residual component (random fluctuations or noise).

Connection or Session features

A connection or a session is a two-way communication between a source and a destination (identified by IP address, port, transport protocol).

Different features can be extracted for each connection (e.g., total bytes between a specific source and destination pair) or for multiple connections (e.g., count of destination IPs, number of unique source/destination IPs)

Behavioral features

Certain behaviors exhibited by network traffic can indicate a specific issue (e.g., botnet, high latency, lateral movement etc.)

These behavioral features can be extracted along multiple axis and some of them are:

Traffic size (e.g., dominant packet sizes, bytes per packet ratio)
Traffic volume (e.g., packets/bytes per flow, number of bytes transferred in a specific direction),
Communication behaviors (e.g., the ratio of incoming to outgoing packets, the duration of a flow).

Protocol-specific features

Each protocol has variety of data that identifies its unique characteristics. Some examples of protocol specific features include ICMP type, DNS record type, HTTP response code. These features can be used to troubleshoot specific protocol issues (e.g., https connection) or identify protocol specific cyber threats (e.g., DNS spoofing)

Statistical features

Statistical measures like mean, median, and standard deviation can be calculated for various features (packet sizes, round trip time) and can provide additional characteristics for identifying normal traffic from abnormal traffic.

We are not including payload or content-based features since most traffic is encrypted.

Feature Engineering Tasks

Identifying what features we need for a use case or task is the first step. Feature engineering collectively includes several other steps to make features suitable for input to a machine learning model.

Feature extraction

Feature extraction calculates statistical relationships among the various features. This process reduces the number of features in a dataset by creating new features from the existing ones. These new set of features should be able to summarize most of the information contained in the original set of features.

For example, a dataset contains the following set of features: source port, destination port and packet size.

By doing statistical feature extraction, we can summarize the data using their statistical values. For e.g., a row of data can contain min, max and standard deviation of packet sizes for each source port and destination port combination. This removes all the redundant packet size values, thereby reducing the size of the dataset.

Feature extraction can be done using simple statistics as shown above or through unsupervised algorithms such as Principal Component Analysis (PCA).

Feature selection

Unlike feature extraction, feature selection reduces the number of features by keeping a subset of the original features while removing ones that are irrelevant or redundant.

Feature selection algorithms include wrapper, filter, and embedded methods.

Feature transformation

Feature transformation changes the values based on some criteria. For example, the value of timestamp can be transformed from UNIX epoch into hour of the day. Transformation is used when you need a different representation of the feature values.

Feature scaling

Feature scaling involves converting the feature values to the same scale to ensure that each feature contributes relative similar numerical weights. For example, the range for number of packets can be between 1-1000 while the range for number of bytes can be between 100-100K. Scaling changes values for both between values of 0 and 1.

Standardization and min-max normalization are two methods used in feature scaling.

Removing Outliers

Outliers in data can adversely affect the performance of machine learning models. Removing outliers is another feature engineering task. For example, a round trip time (RTT) of 1 sec might be an outlier in a sample, where most of the values range from 0.1 to 0.3 seconds. Detecting outliers is not always straightforward. Some outliers are easy to detect while others might be more contextual.

Statistical methods such as Z-Score, unsupervised methods such as Isolation Forest, Local Outlier Factor, One Class SVM and DBSCAN are some of the common methods used for identifying outliers.

Imbalanced data

There are more instances of normal network traffic than anomalous traffic. This causes class imbalance between normal and abnormal traffic data. There are two common methods of dealing with imbalanced data, downsampling the normal data or upsampling the anomalous data. For example, a netflow dataset has 1 anomalous flow and 200 normal flows. Downsampling by a factor of 10 improves the proportion of anomalous to normal from 0.5 % to 5%

Feature Engineering Process

Now that we have identified the features and various feature engineering tasks, how to do we go about doing feature engineering?

One method is trying many feature combinations randomly and see which set of features provide the best machine learning model performance for a given usecase. This requires a lot of computational resources but might uncover some hidden relationships that we might otherwise miss.

The other method is to define feature sets for specific usecases by using expert domain knowledge. By doing this you translate your domain expertise and workflows into a language that machine learning algorithms understand. However, this may too closely embed the biases of the expert, thereby minimizing the advantages provided by machine learning.

A hybrid strategy that combines a domain expert-based feature engineering with selecting some features randomly provides lower computational overhead while ensuring the advantages of using machine learning.

Why do we need feature engineering?

Until now we talked about features, feature engineering tasks and its process. But why do we need feature engineering in the first place.?

There are several reasons to do feature engineering and here are a few of them.

Reduce computational and storage resources – Processing large volumes of data and training machine learning models require significant compute and storage resources. The less data we process the better.
Minimize false positives – When features are carefully extracted for specific use cases using expert knowledge, the machine learning model predictions will be more accurate.
Improve model performance – Even the most advanced machine learning algorithms will underperform if provided with irrelevant features.

To summarize, feature engineering is essential to train machine learning models for noisy network data. It helps embed domain knowledge and restrict processing large volumes of irrelevant network data