University of Notre Dame NetScale Laboratory

Lockdown File Structure, Performance, Scaling

Overview

This article describes the design and implementation of Lockdown File System (LFS) and evaluate its performance. For more information, please refer to the full paper ``ENAVis: Enterprise Network Activities Visualization.'' To appear in the proceedings of the USENIX 22nd Large Installation System Administration Conference (LISA '08), San Diego, CA, November 9-14, 2008. Here in this article, we discuss some details that are not available in the above paper. For example, we explain in detail how the files and directories are structured, what files are expected and what they mean, how to run the program and preparation work, etc.

LFS design and implementation

After one year of deployment of Lockdown local-context data collection system on around 200 hosts at the campus of University of Notre Dame, the backend module is hard to keep up with the large amount of data gathered. There are probably two reasons for this. First, the database relations among tables are probably not well designed at the beginning and putting everything in one table is not a good idea. Second and the probably the most likely reason is once a table has billions of entries, a slight increase on one single update transaction can have profound impact on overall processing time. MySQL database is therefore not suited for this purpose.

To that end, we designed a hierarchical file system that stores the data in such a way that can be scaled to thousands of hosts in an enterprise network running for many years. Since we only need to support the main query type which is "give me all active records from time a to time b ", we do not have to live with the overhead of database. Lockdown file system (LFS) divides data according to the timestamps into subdirectories whose basic unit is one day. The startup data containing the host configuration information is divided by their hostnames and stored unmodified in each host directory. Raw data are processed by their types separately, i.e. netstat, ps, and lsof. In each day's directory, raw data are parsed into hashtable-like data structure in memory and perform insertion/update DB-like operations. The parsing engine would be able to read a fraction of day's data and parsing only one single host at each round to conserve memory. At the end of each round, the ``finished'' records (both start and stop time have been set) will be appended to disk into a file belong to that day and purge them from memory since we do not need them anymore once dumping them into disk. Three files are used for netstat, ps, and lsof. Figure below shows the files and directories of LFS.

LFS_directories.png

In LFS, we do not have a database to store past history and searching all previously stored files from day one is impossible as times goes. The solution is each day's directory now saves all ``active'' records (only start time is set but not the stop time). To be more scalable, those records are subdivided by hosts, in other words, each host would have three files for unfinished netstat, ps, and lsof. Those ``active'' state information would either be rolled over to next day or transitioned to ``finished'' state. At the beginning of next round of processing, the state information will be reconstructed by simply reading the previous file containing all ``active'' records for that specific host. Note that if a host is down and never come back, the previous stored ``active'' records will never get chance being read again since there is no new data coming from that host. Therefore, those pending records would not be rolled over to next day and so on. This prevent the stagnant data from being propagated indefinitely.

Data Processing

The data processing module supports multi-threading environment. In each round of parsing, it first bucket-sorts all raw data in the ``log'' directory into each day's directory. This prevents ls takes too long if one directory accumulates millions of files. The parser reads a fraction (specified at the command line) of days' data from each day's directory and further sorted by hostnames and timestamps. Each parser thread reads each raw data files saves them into equivalent data structure. If the diff_sign is <, the stop time is set equal to the timestamp; if the diff_sign is $>$, then the start time is set to be the timestamp and the stop time is set to be $\infty$.

There are several special situations that need to be considered. For example, in netstat, there can be both IPv4 and IPv6 addressing format and there can be * instead of a specific port number. The state can be missing for UDP traffic. In rare cases, PID and program name can also be missing. In lsof, there are three possible missing fields (i.e. DEVICE, NODE, and SIZE). By experimenting the raw data, it is found if the file descriptor (FD) type is NOFD or unknown, then all three fields are missing. While most file descriptor is REG or DIR, certain FD types have no SIZE associated with them (e.g. when FD equals FIFO, CHR, IPv4, IPv6, RAW, SOCK, UNIX, DEL, etc). These situation must be considered when parsing each record. There exists a few extra work in parsing the netstat records since it needs to look up extra information in ps and lsof records, e.g. the gid, ppid, args_list and absolute pathname of the program for each netstat record. This can be achieved by lookups based on PID in both ps and lsof records. For example, the PID with FD=txt and type=REG will give the full path of a running program in lsof.

The DB-like operation (insert/update/query) are supported by LFS as explained in Section "LFS design and implementation". After parsing and inserting/updating, the raw data are moved to ``Processed'' directory which is automatically archived to ``Processed_gzip'' periodically for backup purpose.

In order for the visualization client tools to download data on-demand, a file server must be set up that reads in the ``repository_text'' directory, which contains all ``ESTABLISHED'' connection records extracted by an ``aggregator'' thread spawned every few hours by the TCP file server application. A simple protocol was designed to communicate between the server and client to ask for connectivity data and host configuration information.

The GUI client than compute various statistics on the number of connections made by enterprise and local users, and top hosts, users, and applications making most connections over a specified time window. Bipartite matching are also computed and XML files are automatically generated representing the graph data to be animated by Prefuse library. We do extend some NodeColor? and NodeAction? class in the Prefuse package to do some customized data exploration upon clicking on nodes. Figure below shows the structure for Lockdown file server and GUI visualization client.

server_client_directories.png

Performance Evaluation

The evaluation metrics are processing time, data cost, and memory usage of backend data processing. First, we are concerned with the processing time which is also the main motivation for this transition from database. The result is actually very satisfying. In our case, a 6-month backlog is cleared in less than one week. For evaluation purpose, we copy three consecutive days of real raw data to a separate machine and do the test there. The hardware configuration is a Sun Fire X2100 with AMD Opteron dual processors, 2 GB SDRAM, and 1TB SATA disks. The real raw data are replicated by simply changing the host names to match 250, 500, 750, 1000, 1250, and 1500 number of hosts. The average processing time for one day's data is shown in the following Figure.

LFS_process_time.png

The average transaction speed is 6573 records/sec. The average netstat, ps and lsof records per host per day is roughly 10k, 5k and 100k respectively. The memory usage is about 400-500 MB with half day reading in advance.

The data cost in terms of disk usage is not a problem. The automation of processing include gtar every day's raw data into one single archive file and remove the original data. Since the gtar archive files can be easily migrated into secondary backup tape or disk, we are only concerned with those ``finished'' and ``active'' record files in LFS directory. Although different hosts have different load, the cooked data in LFS directories from our actual data over the past half year suggests the average total size for one host per day (including both ``finished'' and ``active'' netstat, ps, lsof, and file info logs) is less than 3 MB. The average gzipped file size is only 0.3 MB per host per day. What does this mean to you? Suppose if you have one thousand host, which is an decent scale of medium size of enterprise network, and each of them running an agent, suppose you also want to keep everything during the past year, you need a 1TB disk for LFS and an extra 100 GB space if you also want to keep the gzipped archive of original raw data. If you only manage 300 hosts and want a window size of only the past month, 30 GB disk space is sufficient for everything.

r1 - 13 May 2009 - 14:49:10 - AaronStriegel
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Syndicate this site RSSATOM