Detection and Localization of Network Black Holes
Authors: Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, Alex C. Snoeren
Complete Citation
- Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, Alex C. Snoeren: Detection and Localization of Network Black Holes. INFOCOM 2007: 2180-2188
Abstract
Internet backbone networks are under constant
flux, struggling to keep up with increasing demand. The pace of
technology change often outstrips the deployment of associated
fault monitoring capabilities that are built into today’s IP
protocols and routers. Moreover, some of these new technologies
cross networking layers, raising the potential for unanticipated
interactions and service disruptions that the built-in monitoring
systems cannot detect. In such instances, failures may cause
data packets to be silently dropped inside the network without
triggering any alarms or responses (e.g., the failure is not
routed around). So-called “silent failures” or “black holes”
represent a critical threat to today’s rapidly evolving networks.
In this paper, we present a simple and effective method to
detect and diagnose such silent failures. Our method uses active
measurement between edge routers to raise alarms whenever endto-
end connectivity is disrupted, regardless of the cause. These
alarms feed localization agents that employ spatial correlation
techniques to isolate the root-cause of failure. Using data from
two real systems deployed on sections of a tier-I ISP network, we
successfully detect and localize three known black holes. Further,
we present simulation results demonstrating that our system
accurately and precisely (both greater than 80% according to
our metrics) localizes a variety of failures classes.
Annotations
This paper focused on black holes and silent failures in the context of MPLS-over-IP backbone networks. Black holes or silent failures means the failures that current system fails to detect.
A methodology was developed to detect and localize silent failures:
- Fault detection: edge-to-edge probing
- Fault localization: a greedy approach, called MAX-COVERAGE. MAXCOVERAGE iteratively picks the link that explains the most number of observations in the failure signature, prunes this set of observations from the failure signature and repeats the process until no more observations remain in the failure signature.
- System architecture: Each edge router issues n probes to other edges and report the probes that get lost to the monitoring server. The monitoring server invokes the localization algorithm with the failure signature obtained from the detection system and obtains hypothesis corresponding to each topology snapshot for that failure interval obtained from the OSPF monitor. It then uses the hypothesis selection algorithm followed by the candidate selection algorithm to output the final hypothesis that the operator uses to perform further diagnosis.
--
YingxinJiang - 01 Aug 2007