How to implement AI/ML in Networking – 5 Key Considerations

5 key considerations for AI/ML implementation

AI/ML holds tremendous potential in making network operations more efficient. However, very few AI/ML implementations actually make it to production environments.

The most common reasons for failed implementations are lack of clarity in problem definition, building or buying a solution that is ill suited for the problem at hand and not getting the team on board.

If you are part of the network operations team and you have decided to use AI/ML in your network operations, how can you make sure your solution is implemented properly, is aligned with business goals and can be sustained over a long time?

In this blogpost, we identify the five key considerations that determine whether AI/ML is successful in your networking environment.

Catalog your data

It is a well-known fact that AI/ML algorithms need quality data. Before you embark on implementing AI/ML analytics, it is important to understand your data.

This involves a 3-step approach:

Know your network data

Networks consist of different types of infrastructure technologies, and they generate different types of data

  • What types of data does your infrastructure provide (logs, telemetry, traffic, configuration etc.)
  • What are the data formats and are they suitable for AI/ML based analytics?

Know how data is collected and stored

In most network operation environments, data is extracted from network devices and sent to point-based monitoring and analysis tools.

  • How much data is collected for each type?
  • At what interval is this data collected?
  • Is the data collected in the raw form or as aggregate statistics?
  • Where is the data stored in the short and long term?

Know your data policy

If your organization doesn’t have a data policy, it is useful to create one.  

  • How long is the data retained?
  • Who has access to this data?
  • How can you retrieve it ?

Know your usecase

Once you have understood your data, the next step is to identify the problem you are trying to solve. There is no general-purpose AI/ML solution for all use cases and it is not practical to implement one, so this is a critical step to determine what you want to implement.

The following use cases can greatly benefit from AI/ML analytics and are by no means exhaustive.

Anomaly Detection

An enterprise or service provider’s network can generate several network flows and performance measurements every minute. Network operation teams spend countless hours searching for network anomalies and identifying the root cause. AI/ML is ideally suited to analyze such a large amount of data to identify network anomalies.

Capacity Planning/Optimization

A network performance monitoring system might collect metrics such as latency, packet loss and throughput for every node and link in the network. Traffic patterns also follow time-of-day and day-of-week variations. Because of the difficulty in analyzing this data, capacity planning typically involves ad-hoc assessment of bandwidth and device needs. AI/ML solutions can analyze this time series data to uncover more nuanced view of the traffic patterns and performance trends. This can be used to optimize current traffic flow and plan for future capacity.

Network Troubleshooting

An unexpected interruption of a network service or performance degradation involves manually correlating multiple data sources and alarms and identifying the root cause of the problem. An AI/ML system can find common failure patterns (for e.g., repeated Duplicate ACK in a TCP connection between a server and a client), so they can proactively mitigate the issues before they cause serious problems.

Intelligent Alerting/Reduction of alert noise

Network operation teams analyze and correlate alerts and generate trouble tickets. With the proliferation of alerts from multitude of devices and technologies, it is a dauting task for the engineers to correlate them. This causes alert fatigue that makes them miss important incidents. AI/ML solutions are well suited to make automatic correlations and provide intelligent alerting.

Identify the scope

After identifying your use case, you need to define the scope of where you want to implement your AI/ML solution.  Networks are vast and distributed and trying to implement AI/ML solutions even for a single use case is a dauting task. Until you gain confidence in the solution, start small by limiting the scope of your use cases by one or more of the following criteria.

  • Most commonly occurring application protocols (DNS, HTTP, SMTP).
  • One or more transport protocols (TCP or UDP)
  • Geographic region
  • IP address/subnet ranges
  • Most critical hosts or servers (e.g., database server, webserver)
  • Specific devices (edge router, core router, firewalls)
  • Traffic types (east-west traffic, WAN traffic, wireless traffic)
  • Content type (video, voice)

Get teams on board

Once you have identified the use case and defined the scope of the implementation, you need to have a team that is working towards a common goal.

Identify the key team members in your organization, who are essential to a successful implementation.  They vary by organization and business goals but at a minimum you should have the following people on your team.

  • People who understand the network data and can figure how to use the results from the AI/ML analytics.
  • People who understand the data infrastructure and storage (different from understanding the network data itself)
  • Network operations people who will be using the outputs in their day-to-day operations.
  • Management who will support the effort with adequate resources.

Many organizations run AI/ML analytics that are separate from their operational environments and struggle to bring the two together. However, bridging the two is not only important to get operational insights from AI/ML, but the only viable long-term solution. Otherwise, the AI/ML projects are glorified science experiments that get put on the shelf.  

Determine the solution

The final step in implementing AI/ML solution is to find the technology solution.

What criteria should you use to determine the technology solution?

Buy or Build

Determine whether you want to build in-house, use existing solutions (Commercial or Open Source) or a combination of the two. Each of these has its pros and cons and will also have different technology and staffing needs.  

Additional data infrastructure

Identify what additional resources if any are needed for storing and retrieving the data.

Additional Staffing

Identify and clearly define roles and responsibilities and allocate staffing resources adequately. One of the common challenges in implementing new solutions is that managers expect the team members to do this in addition to their day-to-day duties. In that case, the day-to-day duties take precedence, and this gets put on the back burner.

AI is no longer a buzzword in the enterprise and service provider networking space – and it’s time to prepare for it and do it right, so the implementations are successful in the long run.

by

Want to learn more about AI/ML in Networking ?

Please Sign up to receive our weekly blog posts in your inbox.