Telemetry and Big Data Workshop - abstracts

Session 1

Data Plane Programming / In Band Telemetry, Mauro Campanella, (GARR)

Abstract: The Data Plane Programming task in GN4-3 has focuses in two use cases: simple DDoS identification and In Band Telemetery usong the P4 programming language The talk reports the ongoing INT experience providing new insight in network behaviour and challenges on data collection and presentation.

Bio: Mauro Campanella is innovation coordinator and responsible for international research projects for the Italian research and education Network (GARR). He started working on computers and networks in 1984. He participated to various European Projects. He contributed since the first generation of the GEANT project. Its current effort is on the next generation of the GARR network, GEANT Planning and the BELLA project.

Streaming Telemetry, Pavle Vuletić (UoB)

Abstract: Gathering network service performance parameters in multi-vendor environments was for a long time a challenge. It required either the use of dedicated hardware probes which increased the hardware footprint and the complexity of the PoP configuration or the use of proprietary, vendor-specific, and non-interoperable features on network elements. In the recent couple of years, some standard protocols for network performance evaluation like TWAMP (Two Way Active Measurement Protocol) started to appear as a regular feature in network elements of different vendors. We present an approach that leverages capabilities offered by network devices of multiple vendors for monitoring purposes. Specifically, these can be used as clients/servers of probing mechanisms, i.e. Two-Way Active Measurement Protocol (TWAMP) avoiding the use of external monitoring systems. Measurements from the monitoring process are collected in a streaming telemetry fashion providing real-time information on the network performance. We showcase such mechanisms using virtual routers (Cisco, Juniper) deployed in GÉANT Testbeds Service (GTS) and virtual services on top of the routers. Conducted experiments illustrate that using network devices for service monitoring provides accurate per-segment performance evaluation essentially comparable to the performance of external monitoring mechanisms.

Bio: Pavle Vuletić obtained his BSc, MSc and PhD in Computer Systems and Network Architecture from University of Belgrade, School of Electrical Engineering. He used to work on all the positions from network engineer to the deputy director of AMRES, national research and education network, being responsible mainly for network development. He is currently associate professor at the University of Belgrade, School of Electrical Engineering at the Department of Computer Engineering and Information Theory, teaching Advanced Computer Networks, Data security and Software Defined Networks courses. His research interests span from network management principles and software defined networks to network and service performance and network security. Pavle Vuletić leads WP6 task 3 - Monitoring and Management in the current GN4-3 project

GÉANT Operations use of Telemetry, Richard Havern (GÉANT)

Abstract: GÉANT’s telemetry evolution from traditional techniques to modern streaming telemetry

Bio: Rick Havern, telecommunications career started in the USAF, 11 years worked on the Internet prior to the de-militarisation of the network; 12 years worked for PSINet, the first commercial ISP; joined GÉANT in 2012 to work on the last network refresh, since has become the Head of Network Engineering

Use of Sketches in DPP for DDoS and Monitoring, Damian Parniewicz (PSNC)

Abstract: Programmable data plane platforms like Tofino-based switches and FPGA linecards enable the implementation of new solutions for network traffic handling. The presentation will report the experience of WP6 Data Plane Programming task in the use of “sketches” algoritms implemented directly in the data plane for DDoS traffic detection and traffic monitoring. The sketch structures provide memory-efficient collection of summarised traffic statistics and have interesting benefits in comparison to the other monitoring techniques. They allow to process all incoming packets at wire speed ,because a sketch requires a very limited set of actions to be performed for every packet. This implies that all processed packets can contribute to traffic statistics without any performance penalty. Sketches are a great tool for scalable, fine-grained and millisecond or lower latency inline on-switch network analytics. Benefits, usage use-cases and limitation of P4-based sketch structures deployment will be summairzed basied on our implementation and testing in Tofino switch chip.

Bio: Damian Parniewicz is a researcher in Poznan Supercomputing and Networking Center. He has participated in many European Union Research and Development projects. His major interest areas are network control planes, network monitoring, programmable chipsets, SDN/NFV as well as Big Data technologies, edge computing and ML/DL applied for network security.

Session 2

Scalable and Cost-Efficient Generation of Unsampled NetFlow, Alexander Gall (SWITCH)

Abstract: NetFlow has been around for a long time and it remains to be one of the most important sources of information about the type and characteristics of network traffic. Due to the increase of traffic volumes, most implementations of the exporter part of NetFlow available today resort to heavy sampling of packets to keep the load on the devices at a manageable level. However, unsampled NetFlow is a valuable resource for all kinds of traffic analysis that involves a small number of flows or flows containing very few packets compared to a typical sampling rate.
Examples of such applications are the reliable detection of malicious traffic like C&C communication involving botnets or the debugging of network connectivity issues which sometimes require visibility down to single packets. In order to preserve the capability of producing unsampled NetFlow, SWITCH has first moved from router-based implementations to a dedicated commercial NetFlow appliance and then to an in-house developed solution to increase cost-efficiency as well as flexibility in terms of features
The current solution uses a P4-programmable switch to serve as an aggregator of packets obtained from optical taps of all AS-external links and a pure software-based implementation of an IPFIX-compliant exporter running on off-the-shelve x86-based servers with excellent scaling capabilities with respect to the number of CPU cores. The aggregated volume of traffic subject to NetFlow generation in the SWITCH network currently peaks at around 100Gbps with about 15Mpps and up to 300k flows per second. The presentation discusses the why and the how of this solution.

Bio: Alexander Gall has been working as a network engineer at SWITCH for 20 years. His fields of expertise include IP routing, Multicast, network performance troubleshooting and NetFlow data export. He has been involved in a number of European network-centric projects like 6net, eduPERT, FEDERICA and most recently GN4 WP6 Task 1.

DDoS Detection on P4 SmartNICs, Marinos Dimolianis (NTUA)

Abstract: Legacy monitoring mechanisms rely on packet samples/flow records, exported from agents within network devices (routers, switches). These large amounts of data are typically relayed for processing to external collectors for further analysis. Such approaches struggle to provide real-time information for the network state due to latency introduced by external data collection and processing. Data plane programmability created new pathways for network monitoring, enabling in-network packet processing and analysis. In this context, we will present our P4 ]DDoS detection schema that enables rapid control loops for the identification and mitigation of cyber-attacks. Our approach calculates in the data plane important traffic metrics of malicious traffic, considering short-interval time windows. Comparing these metrics with appropriately defined thresholds, our schema identifies ongoing DDoS attacks, pinpoints the victim network and notifies external systems for mitigation purposes. We deployed our approach on programmable P4-enabled SmartNICs and evaluated it using real network traffic. We will share (i) our experience in designing and implementing data plane programs in P4-enabled SmartNICs and (ii) results related to the detection accuracy and the packet processing performance of our mechanism featuring high-speed packet rates.

Bio: Marinos Dimolianis received his Diploma in Electrical and Computer Engineering from National Technical University of Athens in 2017. He is currently a Ph.D. candidate in the same faculty. He has worked in GN4-2 and GN4-3 projects in Network Monitoring and Management tasks and has also proven experience in the research industry. His research interests lie in Computer Networks, Network Security, and Software-Defined Networking.

Managing the Telemetry Firehose from 1:1 packet sampling, Yatish Kumar (ESnet)

Abstract: ESnet is implementing precision network telemetry that produces telemetry on a 1:1 basis for every flow that is monitored. Simply recording this telemetry for long term analysis is completely unfeasible. Instead we have examined a few techniques for data reduction such that the 1:1 measurements can be reduced in real time for feature extraction and storage. In this talk we will describe the more promising methods for real time data reduction.

Bio: Yatish consults in the area of programmable networking and semiconductor design. He serves as an affiliate with Lawrence Berkeley National Labs, and is responsible for developing ESnet’s High Touch FPGA based SDN platform. Prior to ESnet, he founded Corsa Technology, a successful venture financed SDN startup. He served as the ONF Area Director for standards, including the OpenFlow specification and served on the ONF Chipmaker’s Advisory Board. Yatish has more than 30 years of networking industry experience. Prior to Corsa he was involved in a number of successful startups, including Catena Networks, which was acquired by Ciena in 2004. Yatish started his career at Nortel where he contributed to and managed the development of a number of mixed signal semiconductor projects including designs for ADSL, POTS, CDMA, Cable Modems and handsets. He holds patents in DSP architectures, and data compression and has authored papers on high level synthesis, and embedded processor design as well as contributing to the development of ITU 992.1, ANSI T1.413 and Telcordia GR909 standards.

The path to modern logging, monitoring, and alerting in GARR, Fabio Farina (GARR)

Abstract: GARR reviewed to its approach to monitoring completely during the last two years. Telemetry, collection, analysis and alerting have become a standardized common toolset in GARR services lifecycle for both application and infrastructure monitoring. In this short talk we will discuss how GARR adopted telemetry stacks for Measuring and sensing the instantaneous status of both backbone routers, and ICT Elements like underlay x86 nodes and overlay items participating in (micro-)service architectures. The result of the process allowed GARR to reach a genuine holistic view for most of the end-user services. We will briefly review what tools have been adopted for different monitoring use-cases, how data are extracted, filtered, and shown to the operators. We will show how asynchronous alerting has changed the way the overall monitoring duties are perceived in our everyday activity. We will give a brief overview on how we chose among the many available tools, like Elastic Stack, Influx, Prometheus and Grafana, harmonizing the solutions across the different use-cases, and how we are automating the management of the data lakes and the sensing probes by using Ansible and Kubernetes. Finally, we will present what lessons we learned in this journey and what are our suggestions in approaching this new, exciting and yet somehow unexplored landscape.

Bio: Fabio Farina has a PhD in Computer Science and works with GARR since 2010. Fabio works on European projects, on the creation of new services and on NFV, Edge and orchestration under the GARR Network evolution framework. In detail, during last year Fabio contributed to the refactoring and the automation of the monitoring and logging software stacks adopted by the GARR Infrastructure and System Support departments.

Session 3

Community Shared Telemetry, Karl Newell (Internet2)

Abstract: The Internet2 community, in collaboration with international partners, is working on an effort to collect, or provide access to, telemetry across multiple organizations with a primary goal of supporting analytics and insight into end to end network performance. The effort is being led by Joe Breen, University of Utah; Dan Doyle, Indiana University GlobalNOC; and Karl Newell, Internet2. Internet2 and many of the GlobalNOC customers (US state regional networks) already make some of their network telemetry publicly available. We are working with the community to collect more telemetry into a central time series database. This talk will summarize the collection and analytics efforts and highlight next steps.

Bio: I have been involved in the research and education community for almost 20 years with experience in Linux system administration, network security, network engineering and, now, network automation. Current projects include network service model development to support automation and orchestration for the new Internet2 network and a community effort around sharing network telemetry.

Anomaly detection in Data Center infrastructure, Krzysztof Martyn (PSNC)

Abstract: Infrastructure health monitoring is a difficult and time-consuming process that requires a lot of experience. It requires the operator's constant attention and focus when analysing many metrics describing the state of the system. This is an important aspect in the context of the operation of individual applications, as well as the operation of the entire system, because it allows you to detect or even prevent a failure or identify its cause. For this reason, it is important to have the right tools to support this process. In PSNC data center, we test tools and solutions supporting the process of monitoring the condition of rack servers. One technique is to detect unusual behaviour or changes in computing server behaviour. Each of the servers sends basic telemetry data such as CPU, RAM, disk and network information to the database for further processing by artificial intelligence algorithms. There, the metrics are compared with the historical data and the behaviour model of each server, and the anomaly is reported if any unusual behaviour is detected. Beside servers condition monitoring, we have developed an automation platform for the network monitoring workflows. Using this solution, we can easy access the copy of raw network traffic from many network taps, perform packets headers preprocessing and store data on distributed HDFS data nodes. The last step of the workflow is running artificial intelligence algorithms in order to perform any kind of network analytics. We used this platform to develop unsupervised neural networks for the detection of the DDoS attacks in the PSNC network.

Bio: Krzysztof Martyn is currently working in the network department in PSNC where he is responsible for collect, monitor and analyze telemetry data in data center. He participated in several research projects, where his main task was to create a system that recognizes DDoS attacks, and anomalies in network traffic using ML and Deep Learning methods. Moreover, in the Treatnet project he was responsible for testing the efficiency of Big Data storage and processing tools. He is a PhD student of information technology at the Institute of Computing Science, Poznań University of Technology. His research is focused on decision aiding and machine learning.

Making Sense Of Your Big Data & Empowering Users, Dan Doyle (Indiana Global NOC)

Abstract: experiences monitoring and measuring the many different networks we support at different scales ranging from campus to regional to national and international. In particular, the topics I would like to focus on:
- High level operational views, derived from multiple data sets. Going from many separate data collection systems (logs, monitoring alerts, telemetry, etc) that require a high amount of user expertise to more simplified views.
- Empowering end users to explore and arrange their own data. Going from every report, customization, etc requiring software features and developer time to being able to have end users create and tune their own visualizations.
- The tools that we use to accomplish all of this, including the extensions that we had to do and the limitations or overall philosophy changes that arose as a result.

Bio: GlobalNOC at Indiana University as the head of our Data Collection and Analysis team.

Making OSS Network Data Available to Network Researchers, Alex Moura (RNP)

Abstract: Research and Education Networks gather a wide range of data in their Operations Support Systems (OSS), but in order to make part of that data available to respond to requests from third-party researchers within the academic community that need that data from their networks for research purposes, largely remains as an open problem that brings an operational burden to network operators. Besides that, some data cannot be openly publicized because of security concerns and need to be anonymized or somewhat treated before made available as network datasets for research purposes. This presentation will present a new RNP initiative to streamline the process of preparation and publicizing network datasets for research purposes.

Bio: Alex Moura is Network Engineer and Science Engagement Specialist at RNP, the Brazilian National Research and Education Network, and holds a master's degree in information systems and computer networks from Unirio.

Session 4

Network Telemetry at AmLight, Jeronimo Bezerra (Amlight)

Abstract: Funded by the U.S. National Science Foundation (NSF), AmLight is a distributed academic exchange point connecting national and regional research and education networks in Latin America to the U.S. and Africa. AmLight is responsible for transporting science data related to most telescopes in Chile and supporting the Large Hadron Collider Tier 2 data center in Brazil, and many other science projects.

AmLight operates as an SDN network since 2014 and is being migrated to a white-box infrastructure to support P4Runtime and In-band Network Telemetry (INT). In 2018, Florida International University (FIU) was funded by NSF to evalute telemetry opportunities over AmLight links to enable real-time monitoring of data science flows, including the Vera Rubin Observatory’s flows formerly known as Large Synoptic Survey Telescope (LSST).

Currently, seven Tofino-based white boxes are deployed at AmLight using the NoviWare network operating system to gather and export telemetry reports. With this presentation, we aim to share our experience, achievements, and struggles/challenges.

Answering the conference request:

• Which data are collected and how

In-band Network Telemetry over Tofino chip enables switches to export, per packet, IP+TCP/UDP header, and INT metadata. The INT metadata currently supported includes ingress port ID, egress port ID, ingress timestamp, egress timestamp, hop delay, egress queue ID, and egress queue occupancy). Each Tofino-chip switch in the path adds its INT metadata to user packets. The Tofino chip exports the data directly from the data plane, in real-time, to an INT Collector.

• Tools used for the analysis and presentation/visualization/storage of data

We created several tools for data analysis and visualization/correlation of events.

• Benefits

Real-time visibility of interface buffers/queues gives us an understanding of where the points of attention are. Also, we have proof-of-transit per packet, equivalent to a layer 1/2 traceroute.

• Issues, challenges, and gaps - what you would like to be able to do but cannot

A typical Vera Rubin telescope data transfer will be 5-second bursts of 9Kbytes packets at 40+Gbps from Chile to the U.S. throughout the night. Each burst creates a telemetry flow of 1.4Gbps @ 487kpps and a total 900MB of telemetry data to be processed/stored/shared. The challenge is receiving 487.000 256-byte packets per second, single flow, single NIC queue, single CPU core, and process them in real-time. Without Kernel bypass, most CPU cores will operate at 100% and drop more than 80% of the packets due to the high CPU utilization. And, this is just one flow over AmLight.

Bio: the AmLight Network Architect

DPDK + Kafka: Multi-MPPS Telemetry Data Ingest and Stream Processing at ESnet, Richard Cziva (ESnet)

Abstract: We will introduce ESnet’s per-packet telemetry collection system (uses Xilinx FPGAs).

The main focus will be on our DPDK application called fastcapa-ng that takes telemetry packets from the wire and pushed it to Kafka. It can do filtering, down-sampling (user specified 1:X ration) and histogram generation (user configurable) that we implemented in this DPDK app. We also show prometheus / Grafana integration to monitor our pipeline.

I will show how we can run stream processing application using Kafka Streams API. Simple code for SYN Flood detection example will be shown.

Ingest rate challenges will be highlighted.

Bio: Richard Cziva is a software engineer at ESnet. He has a range of technical interests including traffic and performance analysis, data-plane programmability, high-speed packet processing, software-defined networking, and network function virtualization.
Prior to joining ESnet in 2018, Richard was a Research Associate at University of Glasgow, where he looked at how advanced services (e.g., personalized firewalls, intrusion detection modules, measurement functions) can be implemented and managed inside wide area networks with programmable edge capabilities.
Richard holds a BSc in Computer Engineering (2013) from Budapest University of Technology and Economics, Hungary and a Ph.D. in Computer Science (2018) from University of Glasgow, United Kingdom.

NetSage measurement and monitoring platform, Doug Jontz (Indiana University)

Abstract: NetSage is a unified, open, privacy-aware measurement, analysis, and visualization service designed to address the needs of today’s research and education (R&E) data sharing collaborations. NetSage is currently deployed on both international and US domestic regional networks to help users detect patterns of behaviors and identify possible problems, which can lead to better data transfers. It combines together SNMP, Flow, Tstat data from archives, and data from active perfSONAR measurements into unified views from dashboards.

The innovative aspect of NetSage is not in the individual pieces but rather in the integration of data sources to support objective performance observations as a whole. NetSage deployments can collect data from routers, switches, active testing sites, and science data archives, which are common for collaborative research. NetSage uses a combination of passive and active measurement data to provide longitudinal performance visualizations via performance Dashboards. The Dashboards can be used to identify changes of behaviors over monitored resources, new patterns for data transfers, or unexpected data movement to help researchers achieve better performance for inter-institutional data sharing.

Unlike many other monitoring tools, NetSage was designed to enable further insight of network behaviors by combining multiple data sources to create a result larger than the sum of its parts, and to make that data available to a broad set of end users. NetSage is used for data analysis to understand longer term trends and behaviors, unlike many other tools aimed to support network operations only.

The NetSage software consists of a set of open source tools that are deployed on local systems, and a managed, centralized, secure data archive. NetSage TestPoints are a collection of software and hardware components that gather active and passive data into records that are then sent to the Data Ingest Pipeline. The five-step Pipeline filters those records and adds additional tags before de-identifying the data. The records are then stored in the NetSage Archive, a centralized storage framework consisting of two different databases, a Time Series Data System (TSDS) archive and an Elasticsearch archive. Performance Dashboards, built using the open source Grafana analysis and visualization engine, access the records from the NetSage Archive to present visualizations to answer the questions identified by the stakeholders.

Bio: the Network Systems Analyst from Indiana University

A Proposal towards sFlow Monitoring Dashboards for AI-controlled NRENs, Mariam Kiran (Esnet)

Abstract: Network monitoring collects heterogeneous data such as various kinds of performance data such as TCP transfers, packet-related checks, bandwidth, download speeds, and more, usually through passive and active probing of the network. Multiple monitoring tools can help collect these disparate, heterogeneous metrics, but mostly through probing the network which introduces challenges of extra noise or packets that are also recorded. Additionally having a visualization tool that encompasses all this data into one is challenging to build. In this paper, we start by discussing NetGraf, a tool we were developing for multiple network monitoring tools to visualize using Grafana, and discuss the key findings and challenges we faced. As a result, we propose to further work towards sFlow monitoring dashboard to improve network monitoring challenges. This paper contributes to the theme of automating open-source network monitoring tools software setups and their usability for researchers looking to deploy an end-to-end monitoring stack on their own testbeds.

Bio: Mariam Kiran is a research scientist with shared positions with Energy Sciences Network and the Scientific Data Management (SDM) group in Computational Research Division. Her work specifically concentrates on using advanced software and machine learning techniques to advance system architectures, particularly high-speed networks such as DOE networks.

Her current work is exploring reinforcement learning, unsupervised clustering and classification techniques to optimally control distributed network resources, improving high-speed big data transfers for exascale science applications and optimize how current network infrastructure is utilized. Kiran is the recipient of the DOE ASCR Early Career Award in 2017. Before joining LBNL in 2016, Kiran held positions as a lecturer and research fellow at the Universities of Sheffield and Leeds in the UK. She earned her undergrad and PhD degree in software engineering and computer science from the University of Sheffield, UK in 2011.

Page tree

Telemetry and Big Data Workshop - abstracts