Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Network Telemetry at AmLight, Jeronimo Bezerra (Amlight)

Abstract

Bio:

DPDK + Kafka: Multi-MPPS Telemetry Data Ingest and Stream Processing at ESnet, Richard Cziva (ESnet)

Abstract

Bio: Richard Cziva is a software engineer at ESnet. He has a range of technical interests including traffic and performance analysis, data-plane programmability, high-speed packet processing, software-defined networking, and network function virtualization.
Prior to joining ESnet in 2018, Richard was a Research Associate at University of Glasgow, where he looked at how advanced services (e.g., personalized firewalls, intrusion detection modules, measurement functions) can be implemented and managed inside wide area networks with programmable edge capabilities.
Richard holds a BSc in Computer Engineering (2013) from Budapest University of Technology and Economics, Hungary and a Ph.D. in Computer Science (2018) from University of Glasgow, United Kingdom.

NetSage measurement and monitoring platform, Doug Jontz (Indiana University)

...

Funded by the U.S. National Science Foundation (NSF), AmLight is a distributed academic exchange point connecting national and regional research and education networks in Latin America to the U.S. and Africa. AmLight is responsible for transporting science data related to most telescopes in Chile and supporting the Large Hadron Collider Tier 2 data center in Brazil, and many other science projects.

AmLight operates as an SDN network since 2014 and is being migrated to a white-box infrastructure to support P4Runtime and In-band Network Telemetry (INT). In 2018, Florida International University (FIU) was funded by NSF to evalute telemetry opportunities over AmLight links to enable real-time monitoring of data science flows, including the Vera Rubin Observatory’s flows formerly known as Large Synoptic Survey Telescope (LSST).

Currently, seven Tofino-based white boxes are deployed at AmLight using the NoviWare network operating system to gather and export telemetry reports. With this presentation, we aim to share our experience, achievements, and struggles/challenges.

Answering the conference request:

• Which data are collected and how

In-band Network Telemetry over Tofino chip enables switches to export, per packet, IP+TCP/UDP header, and INT metadata. The INT metadata currently supported includes ingress port ID, egress port ID, ingress timestamp, egress timestamp, hop delay, egress queue ID, and egress queue occupancy). Each Tofino-chip switch in the path adds its INT metadata to user packets. The Tofino chip exports the data directly from the data plane, in real-time, to an INT Collector.

• Tools used for the analysis and presentation/visualization/storage of data

We created several tools for data analysis and visualization/correlation of events.

• Benefits

Real-time visibility of interface buffers/queues gives us an understanding of where the points of attention are. Also, we have proof-of-transit per packet, equivalent to a layer 1/2 traceroute.

• Issues, challenges, and gaps - what you would like to be able to do but cannot

A typical Vera Rubin telescope data transfer will be 5-second bursts of 9Kbytes packets at 40+Gbps from Chile to the U.S. throughout the night. Each burst creates a telemetry flow of 1.4Gbps @ 487kpps and a total 900MB of telemetry data to be processed/stored/shared. The challenge is receiving 487.000 256-byte packets per second, single flow, single NIC queue, single CPU core, and process them in real-time. Without Kernel bypass, most CPU cores will operate at 100% and drop more than 80% of the packets due to the high CPU utilization. And, this is just one flow over AmLight.

Bio:

DPDK + Kafka: Multi-MPPS Telemetry Data Ingest and Stream Processing at ESnet, Richard Cziva (ESnet)

Abstract

Bio: Richard Cziva is a software engineer at ESnet. He has a range of technical interests including traffic and performance analysis, data-plane programmability, high-speed packet processing, software-defined networking, and network function virtualization.
Prior to joining ESnet in 2018, Richard was a Research Associate at University of Glasgow, where he looked at how advanced services (e.g., personalized firewalls, intrusion detection modules, measurement functions) can be implemented and managed inside wide area networks with programmable edge capabilities.
Richard holds a BSc in Computer Engineering (2013) from Budapest University of Technology and Economics, Hungary and a Ph.D. in Computer Science (2018) from University of Glasgow, United Kingdom.


NetSage measurement and monitoring platform, Doug Jontz (Indiana University)

Abstract: NetSage is a unified, open, privacy-aware measurement, analysis, and visualization service designed to address the needs of today’s research and education (R&E) data sharing collaborations. NetSage is currently deployed on both international and US domestic regional networks to help users detect patterns of behaviors and identify possible problems, which can lead to better data transfers. It combines together SNMP, Flow, Tstat data from archives, and data from active perfSONAR measurements into unified views from dashboards.

The innovative aspect of NetSage is not in the individual pieces but rather in the integration of data sources to support objective performance observations as a whole. NetSage deployments can collect data from routers, switches, active testing sites, and science data archives, which are common for collaborative research. NetSage uses a combination of passive and active measurement data to provide longitudinal performance visualizations via performance Dashboards. The Dashboards can be used to identify changes of behaviors over monitored resources, new patterns for data transfers, or unexpected data movement to help researchers achieve better performance for inter-institutional data sharing.

Unlike many other monitoring tools, NetSage was designed to enable further insight of network behaviors by combining multiple data sources to create a result larger than the sum of its parts, and to make that data available to a broad set of end users. NetSage is used for data analysis to understand longer term trends and behaviors, unlike many other tools aimed to support network operations only.

The NetSage software consists of a set of open source tools that are deployed on local systems, and a managed, centralized, secure data archive. NetSage TestPoints are a collection of software and hardware components that gather active and passive data into records that are then sent to the Data Ingest Pipeline. The five-step Pipeline filters those records and adds additional tags before de-identifying the data. The records are then stored in the NetSage Archive, a centralized storage framework consisting of two different databases, a Time Series Data System (TSDS) archive and an Elasticsearch archive. Performance Dashboards, built using the open source Grafana analysis and visualization engine, access the records from the NetSage Archive to present visualizations to answer the questions identified by the stakeholders.

Bio:

A Proposal towards sFlow Monitoring Dashboards for AI-controlled NRENs, Mariam Kiran (Esnet)

Abstract: Network monitoring collects heterogeneous data such as various kinds of performance data such as TCP transfers, packet-related checks, bandwidth, download speeds, and more, usually through passive and active probing of the network. Multiple monitoring tools can help collect these disparate, heterogeneous metrics, but mostly through probing the network which introduces challenges of extra noise or packets that are also recorded. Additionally having a visualization tool that encompasses all this data into one is challenging to build. In this paper, we start by discussing NetGraf, a tool we were developing for multiple network monitoring tools to visualize using Grafana, and discuss the key findings and challenges we faced. As a result, we propose to further work towards sFlow monitoring dashboard to improve network monitoring challenges. This paper contributes to the theme of automating open-source network monitoring tools software setups and their usability for researchers looking to deploy an end-to-end monitoring stack on their own testbeds.

Bio: Mariam Kiran is a research scientist with shared positions with Energy Sciences Network and the Scientific Data Management (SDM) group in Computational Research Division. Her work specifically concentrates on using advanced software and machine learning techniques to advance system architectures, particularly high-speed networks such as DOE networks.

Her current work is exploring reinforcement learning, unsupervised clustering and classification techniques to optimally control distributed network resources, improving high-speed big data transfers for exascale science applications and optimize how current network infrastructure is utilized. Kiran is the recipient of the DOE ASCR Early Career Award in 2017. Before joining LBNL in 2016, Kiran held positions as a lecturer and research fellow at the Universities of Sheffield and Leeds in the UK. She earned her undergrad and PhD degree in software engineering and computer science from the University of Sheffield, UK in 2011.