John McHugh's home page


I can be reached by email by using my last name at cs dot dal dot ca

John McHugh is a professor in the Faculty of Computer Science at Dalhousie University in Halifax, Nova Scotia, Canada where he holds a Canada Research Chair in Privacy and Security and directs the Privacy and Security Laboratory.

John McHugh

John McHugh left a position as senior member of the technical staff at CERT, part of the SEI at CMU to become the director of the new privacy and security lab at Dalhousie University. He was a professor and former chairman of the Computer Science Department at Portland State University in Portland, Oregon where he held a Tektronix Professorship. His research interests include computer security, software engineering, and programming languages. He has previously taught at The University of North Carolina and at Duke University. He has been an active researcher in the application of formal methods to the construction of dependable and secure systems for many years. He was the architect of the Gypsy code optimizer and the Gypsy Covert Channel Analysis tool.

Dr. McHugh received his PhD degree in computer science from the University of Texas at Austin. He has a MS degree in computer science from the University of Maryland, and a BS degree in physics from Duke University. He grew up in Durham, North Carolina, leaving when he graduated from Duke. Twenty years later, he returned, demonstrating that Thomas Wolfe was wrong. After another ten years in Durham, he moved to Portland, demonstrating, perhaps, that Wolfe knew what he was talking about after all.

A recent copy of Dr. McHugh's CV is on line.


Courses

During the winter term of 2006, I taught CSCI 6905, a special topics graduate course entitled "Collection and analysis of large scale network data with a focus on network security." Notes on available data sets and potential projects can be found here.

A similar course, CSCI 6908, "Advanced Network Data Analysis and Intrusion Detection," was offered in the summer of 2006 and will be offered again during the summer of 2007. Students will perform projects involving analysis of existing data sets and / or the enhancement of analysis and collection tools or will perform experimental work in intrusion detection. Each project will produce a paper suitable for submission to an appropriate conference. See this description for additional information.


Projects and working with me

As of April 2007, I have modest research funding that can be used to support either PhD students or Masters Students and I have a number of proposals outstanding that shopuld provide additional support by fall or winter of the comming academic year.

I am interested in working with both MS and PhD students in the general area of computer security. For PhD students, I would prefer to serve as a co-advisor with another faculty member unless you have a specific project in mind that can be executed and defended within a timeframe of two to three years. For MS students, I have a number of potential projects and am also willing to consider possible projects that you may suggest.

I get a fairly large number of non-specific form letters that say that the sender is interested in my research area and would like to come to Dalhousie to work with me, but do not supply any details of the senders interest or any indications of specific projects that they might like to work on. I am unlikely to reply to these.

Students inquiring from abroad should take the time to become familiar with the appropriate Canadian student visa / study permit requirements before inquiring. See, for example, this site for some details and be sure you understand the requirements for admission to the CS programs at Dalhousie University.

Projects that interest me

The following list is not intended to be exhausive, but is illustrative of the kinds of things I am interested in doing. These are primarily tool building projects, but there are a variety of analysis projects available, as well.

Note: A substantial portion of my recent research activities have been based on the analysis of network data obtained from routers Much of this work has been done using the Silk Tools developed by the CERT at Carnegie Mellon University. The tools are publicly available , but much of the current development effort has been focused away from extensions that enhance the tools' value for research and directed towards the operational needs of the supporting customer. It is my opinion that these tools provide an interesting paradigm for looking at large scale network data whether obtained from netflow, packet traces, log data, or other sources. The filter tool can partition data based of a number of criteria, including time, addressing information, protocols, ports, volumes, etc. Other tools and can build sets or multisets (counted sets with counts based on record, packet, or byte volumes) indexed by IP address, ports, protocols, etc. and these can be manipulated to provide interesting views of network behavior over time.

The next several topics are directed towards possible extensions or enhancements of these tools.

  • Alternate data sources for SIlK analysis inputs - packets, logs, etc.

    The primary source of data for SiLK analysis consists of NetFlow data which can be obtained from routers or devices such as AMP. The data is stored in files that have fields for source and destination IP addresses, protocol, byte and packet counts, and time. For UDP and TCP, ports are also stored and flags for TCP. In addition, there is a field that identifies the sensor supplying the data and several router specific fields, input and output interface identifiers and the next hop IP address. Since the analysis is largely based on addresses protocols, volumes, etc. any data with appropriate fields can be transformed into a form suitable for this analysis. I have developed prototype tools to convert packet data from TCPdump or similar sources into degenerate flows (1 flow record per packet), and to convert text records in the format produced by the flow listing routine back into flows. Both of these prototypes need better integration into the tool set. In particular, the packet tool needs to be integrated into the flow filtering tool as a "front end" that can merge data obtained from multiple sensors in an enterprise network, eliminating duplicate packets and tagging flows with the sensor that captured them. The integrated tool would also have the ability to filter based on packet contents and would output both packet and flow records for packets that satisfy the filtering criteria. These capabilities will allow, for example, data associated with a particular malicious code to be extracted for more detailed analysis using both flow based and packet criteria. The log data prototype accepts only a fixed format, so auxiliary scripts are necessary to manipulate the log data prior to conversion. A proper program would allow the user to specify the format of the log data and the source for each field in the flow record, including default values for fields not present in the log.

  • Silk Tools Extensions and Enhancements

    Experience to date with the Silk Tools has shown them to be a valuable addition to the security analysts toolkit, however, many of their features could use improvement and there are a number of additional tools that could be added to the suite. This project will generalize a number of features that are already present in the tool set and add several new features. The list of possible enhancements given below is illustrative. many others are possible.

    As an example, the bag or multiset tool is able to create counted sets (with counts based of flow records, packets, or bytes) from a number of parameters, source IP, destination IP, protocol, source port, and destination port, but lacks the ability to do this for other scalar fields such as TCP flags, sensor ID, input or output interface ID or next hop IP. Since these fields can be assigned arbitrary values when the data comes from non flow sources such as packet capture files or log records, the ability to cluster and count values for these fields is useful for characterizing the data or added meta data. In addition, ports are overloaded, appearing in both TCP and UDP records, and the ability to create separate multisets for each ported protocol would be useful. One of the port fields is further overloaded being used to hold ICMP code and message information for that protocol, and the ability to create a multiset of this information would also be useful. Similar enhancements might be useful for the set tool which can only create IP sets (source, destination, next hop) at present.

    At one point, I experimented with a partioning tool based on Bloom filters that allowed me to identify multiple flow records having the same connection parameters, i.e. source and destination addresses and ports. It had the ability to interchange source and destination information to bring together both sides of an exchange. A generalization of this tool, with the ability to include arbitrary fields or partial fields (e.g. masked addresses that allow all hosts on a subnet to be considered as one) would be a powerful addition to the tools.

    A flow consolidator would also be useful. Long or intermittent sessions are spread over multiple flow records due to timeout policies in netflow. Identifying flow records that belong to the same session and merging them would be useful, as would bringing the two sides of a bidirectional flow together into a single record. In practice, the amount of noise in the data makes any straightforward consolidation difficult. Bloom filters can be used to identify data that can be merged and eliminate data that has no matches. Since the latter may be as much as 90% of the total, this makes the matching process much more efficient.

    There are a number of performance related enhancement that would be worth including. For many analyses, files containing gigabytes of data are processed. The processing time is dominated by disk read time (jobs are heavily I/O bound) when the work is done on a fast processor with lots of memory. Two changes would greatly speed processing. One is to give all the tools the ability to read and write compressed data files. In many cases, the jobs remain I/O bound even when the increase in processing and the decrease in I/O are taken into account. The other is to extend the pipe lining capability built into some, but not all of the programs. This would allow all programs to pass their input, unmodified, to a pipe where it would be buffered in memory for the next program in the chain. This has proven to be highly effective for the filter program and a few others and could easily be added to any program that operates on the flow data formats. There are a number of cases where better choices of data structures would make for substantial speedups, especially in the multiset tools.

  • Real time extensions to SiLK tools

    As originally designed, the SiLK tools were intended for retrospective analysis, but the concepts appear to be useful for real time or near realtime monitoring of networks. Using real time packet capture data or a modified flow collector, it would be possible to pipe data through the packet conversion and filtering tool in a continuous stream. Tools such as the set and multiset tools could be configured to pass on their results on a periodic basis, based either on a relatively short time interval or the occurrence of a predetermined flow volume. the time series of sets and multisets thus produced could then be displayed and input into other tools that would compare the current results with historical data and alert the user when network properties differ from historical or expected values. The advantage of this approach is that it allows parallel capture and examination of a large number of network parameters so that the context for changes can easily be seen. The short term sets and multisets can also be aggregated to provide a basis for long term analysis and trending.

  • Traffic characterizations for SiLK tools

    The fact that it does not capture payload information is both a strength and a weakness of the SiLK tools approach Omitting payload makes the storage requirement much less, but makes it difficult to determine if two flows were identical or similar. For some malicious code, the combination of protocol, port, and length may provide a strong indication that a given set of flows were equivalent, but stronger evidence is desired. For packet or multiflow sessions, it would be useful to have a payload characterization that would compose as records are combined. We have considered a number of possibilities ranging from computation of 8 bit checksums over pure payloads to byte profiling in the manner of Sal Stolfo's Payl system which uses byte frequencies. The checksum has the advantage of composing, but provides no insight into the nature of the contents. Payl can require up to 256 counters, but reasonable clustering can be recorded more compactly, but it is unclear whether the more compact representations compose. The thrust of this project would be to investigate methods for providing traffic characterization at the session level in ways that can be recorded in no more than a few bytes. This might take the form of a small feature vector or one of a small set of predefined characterization values.

    The next two projects assume that a source of suitable data is available. We have some data sets available, but it would be preferable to do these projects with a real, continuously captured live stream.

  • Long term traffic profiling, trending, and archiving. Netflow data is a much more compact representation than full packet capture or even packet header data, nonetheless, its volume tends to add up over time. This project will investigate a number of ways in which long time data can be aggregated and retained. The basic philosophy is that a loss of precision for long ago events is tolerable. For example, during an attack, it is useful to examine each packet or flow involving the attackers and victims in detail. When a compromised machine is found, it may be useful to try to identify the session in which the compromise occurred, even if it happened weeks or months ago. On the other hand, the packet by packet behavior an attacker who performed a wholesale scan of a sparsely populated network and got no responses from any machine can easily be abstracted into a single record noting the approximate time of the scan, the netblock scanned and the fact that no one answered. In general, traffic sent to empty addresses can be consolidated fairly quickly.

    We can use counted sets to track hosts who are active and consolidate their activities over some period of time. Similar aggregations can track IP level protocol usage and service usage for TCP and UDP. These sets and multisets can be computed at regular intervals and used for trending, identification of anomalous changes in behavior, and the like. We note that a substantial portion of the traffic that we have seen in the past is very low frequency with a large number of host generating a single flow per hour and not returning for days, weeks or months, if at all. Many of these observations may represent spoofed addresses. One objective of the study will be to try to understand this class of traffic and find efficient ways to keep it for long term analysis since it may contain stealthy malicious activity.

  • Visualizations The data with which we are working is too voluminous to examine in tabular form until a great deal of selection filtering has been performed. We have developed some preliminary visualization tools that have proven useful in providing insight into network behavior at scales ranging from individual machines to the entire monitored border. One of these led to the discovery of a contact frequency anomaly that was ultimately trance to a particular feature (the presence of a "sleep" system call) in the code of the WelchiaB worm. Another is useful in identifying the roles of individual machines on the monitored network. We propose to continue these efforts and simplify their use under the mantra of "Insight, not just pretty pictures" We will be happy to provide example illustrations.

    The final two topics are related, but it seems reasonable to factor the problem into two components.

  • Circular buffer traffic capture device Continuous full packet traffic capture is resource intensive, especially on a loaded, high speed network. At the same time, traffic abstractions such as netflow fail to provide the detail necessary to understand certain attacks. The proposed capture device is conceptually very simple. A full packet capture interface monitors the network and deposits all the traffic it sees into a circular memory buffer, overwriting old traffic with new as the buffer fills. Depending on the traffic rate, the buffer size, and possibly a filtering policy that drops "uninteresting" traffic before it reaches the buffer, the buffer will contain traffic for the recent past, a period that may last for seconds (10 Gbit rate with say 8 Gbytes of buffer) up to minutes or even hours. The device is capable of dumping the captured traffic to an external storage device upon receipt of a trigger event and will continue to do so until told to stop. This provides context for the cause of the trigger and allows, for example, the initial infection packets for a worm that is noticed because it opens an unexpected port when the infection is successful, to be captured.

  • Anomalous traffic triggers

    This project will develop criteria for the generation of alerts that can serve as trigger events for the traffic capture device described in the previous section. One possible such alert source would be an IDS system such as Bro, Snort, or proprietary systems. Other sources, including traffic anomalies such as those developed from traffic profiling and monitoring of the sort described in topic 6, and the results of honeypot interactions. An objective of the project will be to balance the precision of the alerts against the need to develop them in a timely fashion so that relevant traffic is still available.

    Last modified: Mon Mar 13 12:47:17 AST 2006