Detection and Localization of Network Black Holes

Authors: Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, Alex C. Snoeren

Complete Citation

  • Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, Alex C. Snoeren: Detection and Localization of Network Black Holes. INFOCOM 2007: 2180-2188

Abstract

Internet backbone networks are under constant flux, struggling to keep up with increasing demand. The pace of technology change often outstrips the deployment of associated fault monitoring capabilities that are built into today’s IP protocols and routers. Moreover, some of these new technologies cross networking layers, raising the potential for unanticipated interactions and service disruptions that the built-in monitoring systems cannot detect. In such instances, failures may cause data packets to be silently dropped inside the network without triggering any alarms or responses (e.g., the failure is not routed around). So-called “silent failures” or “black holes” represent a critical threat to today’s rapidly evolving networks. In this paper, we present a simple and effective method to detect and diagnose such silent failures. Our method uses active measurement between edge routers to raise alarms whenever endto- end connectivity is disrupted, regardless of the cause. These alarms feed localization agents that employ spatial correlation techniques to isolate the root-cause of failure. Using data from two real systems deployed on sections of a tier-I ISP network, we successfully detect and localize three known black holes. Further, we present simulation results demonstrating that our system accurately and precisely (both greater than 80% according to our metrics) localizes a variety of failures classes.

Annotations

This paper focused on black holes and silent failures in the context of MPLS-over-IP backbone networks. Black holes or silent failures means the failures that current system fails to detect.

A methodology was developed to detect and localize silent failures:

  • Fault detection: edge-to-edge probing
  • Fault localization: a greedy approach, called MAX-COVERAGE. MAXCOVERAGE iteratively picks the link that explains the most number of observations in the failure signature, prunes this set of observations from the failure signature and repeats the process until no more observations remain in the failure signature.

  • System architecture: Each edge router issues n probes to other edges and report the probes that get lost to the monitoring server. The monitoring server invokes the localization algorithm with the failure signature obtained from the detection system and obtains hypothesis corresponding to each topology snapshot for that failure interval obtained from the OSPF monitor. It then uses the hypothesis selection algorithm followed by the candidate selection algorithm to output the final hypothesis that the operator uses to perform further diagnosis.

score-architecture.png

-- YingxinJiang - 01 Aug 2007

Topic attachments
I Attachment Action Size Date Who Comment
pngpng score-architecture.png manage 12.4 K 01 Aug 2007 - 19:22 YingxinJiang System architecture
Topic revision: r1 - 01 Aug 2007 - 19:23:20 - YingxinJiang
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback