It starts with the right data – Categorizing Network Data for Machine Learning

In a theoretical world, you can get a cleaned and preprocessed data set. You can then use these datasets to build Machine Learning models.

However, in a practical world, collecting, understanding, and processing all the data coming from the network infrastructure and applications is the most important step in building your Network Analytics solution.

The information we can get from a network differs based on the where the data is collected (e.g., different devices, plane of operation), how (e.g., data collection methods), when (e.g., network conditions), or what is collected (e.g., flows, logs).

Furthermore, each use case – network attack detection, network fault and performance prediction, service assurance – requires a different subset of network data.

In this post, we will provide a high-level categorization of network data and will do a deep dive into each of these categories in subsequent blog posts.

Network Data Categorization

We are proposing seven ways of categorizing network data. For building an Network Analytics solution using Machine Learning, it is often required to combine multiple categories of data to get better results.

Granularity

Network traffic data can be collected at different levels of granularity and can provide different levels of visibility. Collecting data at the lowest granularity is not needed for all usecases.

Packet level, which provides the lowest level of granularity and can include full payload or just packet headers. Since most traffic today is encrypted, payload inspection is not possible.
Flow level, which includes NetFlow, IPFIX and sFlow in addition to vendor specific flow protocols. Flows can identify the sessions and is categorized by packets sharing the same 5-tuple (Source/Destination IP address, Source/Destination port, Transport protocol). Flow data provides visibility into traffic volume by IP address, protocol etc.
Metrics, either through SNMP or Telemetry which provide interface and link level traffic statistics.

Domain

There are three distinctive domains that make up a network. A device, a service running on the device (we are not talking about end user applications here) and the network that spans multiple devices and services.

Network level data is obtained from network traffic (as indicated in the previous section). The other type of network level data is network topology which consists of nodes and links and their connecting information.
Service level data is obtained through different server logs (webserver, DNS server etc.). They capture events that are defined by the service, for e.g., a webserver such as Apache can be configured to log all the requests made to the server.
Device level data consists of the following – Metrics on the device’s state (e.g., CPU utilization, memory usage), Logs provided by Firewalls, Endpoint Detection Systems (EDR)/host-based Intrusion Detection Systems (IDS) (e.g., information about the user logins, URLs) and Configuration data for each device (e.g., interfaces, VLANs, QoS, ACLs)

A single event can look very different depending on which domain the data is coming from. For an event that captures a user accessing a website, an Apache webserver access log might contain the IP address of the client, HTTP methods used and HTTP response codes, while a packet capture of the connection between the client and the server will contain information about TCP handshake and HTTP and/or TLS protocol information, and data from the client and server devices can have logs that capture user login information.

Different planes

There are three planes of operation in a network. These represent data that serve different purposes in the network infrastructure.

Management plane is used to communicate with the devices to configure and monitor them. Examples of protocols in this plane include SNMP, SSH, Secure FTP.
Control plane is where network devices communicate with each other to set up the channels or paths for data transmission over the network. Examples include Layer 2 protocols such as Spanning Tree Protocol (STP), ARP, LACP, Layer 3 protocols such as IGMP, ICMP and dynamic IP routing protocols such as OSPF, BGP.
Data plane, which is the forwarding plane, is responsible for the switching of packets through the switches and routers. The data from this plane can provide us with information about the volume of traffic and applications. Quality of service (QoS) and Access Control Lists (ACL) features are typically implemented in the data plane.

Collection methods

Network data collection can be categorized into two methods.

Passive data collection – Data is collected through passively observing the networks to capture network traffic and associated statistics. In this there is no alterations made to the network traffic.
Active data collection – Data is collected by actively generating packet streams or other synthetic data (non-user data). Probing and scanning tools such as ping, traceroute, MTR, NMAP are used to generate such data.

It is important to note that most of the network data collected for analytics uses passive data collection methods. However, in certain cases, data generated through active scanning is helpful.

Collection nodes

The network consists of various types of devices, both physical and virtual. The general data provided by any device can fall into one of the following categories – Flow based, Packet based, Logs and Metrics.

The data however varies based on the following:

Device function – Whether the device is forwarding device (Routers, Switches), security device (Firewalls, IDS) or end devices (Servers/Hosts). In a cloud environment, you could get different data based on the different cloud functions or services (compute, networking, storage etc.)
Vendor – While there is some standardization (SNMP MIBs and Flow templates), the types of data and the level of detail varies between vendors for the same type of device (e.g., Palo Alto firewall vs F5) or cloud computing service.

OSI Layer

When analyzing data from networking environments, it is necessary to understand the level of abstraction. The OSI 7 layer/TCP-IP 5-layer models provide abstraction for different network functionality at each layer.

Each layer has its own set of protocols and a packet capture tool such as Wireshark can provide insights into the data available at different layers of the stack.

Data Link Layer – This is typically ethernet level data such as source and destination MAC addresses. The data from this layer has limited use when it comes to analytics.
Network Layer – This has the IP layer information such as source and destination IP address, IP flags and IP fragmentation.
Transport Layer – This is either TCP or UDP level information and contains port numbers, handshake information, TCP flags, TCP window sizes, and retransmission.
Application Layer – The application layer contains information specific to application layer protocols such as HTTP and DNS.

Other types of data

While the above encompasses most of the network data, there are other types of network data that can be used in building or augmenting machine learning models.

Performance Monitoring data from routers and other systems that determine the quality, performance, and reliability to track the SLA agreed by the user and the network service provider.
Fault Monitoring data from routers, and other systems that can provide information on network faults, warnings and other alerts.