John McHugh is a professor in the Faculty of Computer Science at Dalhousie University in Halifax, Nova Scotia, Canada where he holds a Canada Research Chair in Privacy and Security and directs the Privacy and Security Laboratory.
John McHugh left a position as senior member of the technical staff at CERT, part of the SEI at CMU to become the director of the new privacy and security lab at Dalhousie University. He was a professor and former chairman of the Computer Science Department at Portland State University in Portland, Oregon where he held a Tektronix Professorship. His research interests include computer security, software engineering, and programming languages. He has previously taught at The University of North Carolina and at Duke University. He has been an active researcher in the application of formal methods to the construction of dependable and secure systems for many years. He was the architect of the Gypsy code optimizer and the Gypsy Covert Channel Analysis tool.
Dr. McHugh received his PhD degree in computer science from the University of Texas at Austin. He has a MS degree in computer science from the University of Maryland, and a BS degree in physics from Duke University. He grew up in Durham, North Carolina, leaving when he graduated from Duke. Twenty years later, he returned, demonstrating that Thomas Wolfe was wrong. After another ten years in Durham, he moved to Portland, demonstrating, perhaps, that Wolfe knew what he was talking about after all.
A recent copy of Dr. McHugh's CV is on line.
During the winter term of 2006, I taught CSCI 6905, a special topics graduate course entitled "Collection and analysis of large scale network data with a focus on network security." Notes on available data sets and potential projects can be found here.
A similar course, CSCI 6908, "Advanced Network Data Analysis and Intrusion Detection," was offered in the summer of 2006 and will be offered again during the summer of 2007. Students will perform projects involving analysis of existing data sets and / or the enhancement of analysis and collection tools or will perform experimental work in intrusion detection. Each project will produce a paper suitable for submission to an appropriate conference. See this description for additional information.
As of April 2007, I have modest research funding that can be used to support either PhD students or Masters Students and I have a number of proposals outstanding that shopuld provide additional support by fall or winter of the comming academic year.
I am interested in working with both MS and PhD students in the general area of computer security. For PhD students, I would prefer to serve as a co-advisor with another faculty member unless you have a specific project in mind that can be executed and defended within a timeframe of two to three years. For MS students, I have a number of potential projects and am also willing to consider possible projects that you may suggest.
I get a fairly large number of non-specific form letters that say that the sender is interested in my research area and would like to come to Dalhousie to work with me, but do not supply any details of the senders interest or any indications of specific projects that they might like to work on. I am unlikely to reply to these.
Students inquiring from abroad should take the time to become familiar with the appropriate Canadian student visa / study permit requirements before inquiring. See, for example, this site for some details and be sure you understand the requirements for admission to the CS programs at Dalhousie University.
The following list is not intended to be exhausive, but is illustrative of the kinds of things I am interested in doing. These are primarily tool building projects, but there are a variety of analysis projects available, as well.
Note: A substantial portion of my recent research activities have been based on the analysis of network data obtained from routers Much of this work has been done using the Silk Tools developed by the CERT at Carnegie Mellon University. The tools are publicly available , but much of the current development effort has been focused away from extensions that enhance the tools' value for research and directed towards the operational needs of the supporting customer. It is my opinion that these tools provide an interesting paradigm for looking at large scale network data whether obtained from netflow, packet traces, log data, or other sources. The filter tool can partition data based of a number of criteria, including time, addressing information, protocols, ports, volumes, etc. Other tools and can build sets or multisets (counted sets with counts based on record, packet, or byte volumes) indexed by IP address, ports, protocols, etc. and these can be manipulated to provide interesting views of network behavior over time.
The next several topics are directed towards possible extensions or enhancements of these tools.
The primary source of data for SiLK analysis consists of NetFlow data
which can be obtained from routers or devices such as AMP. The data
is stored in files that have fields for source and destination IP
addresses, protocol, byte and packet counts, and time. For UDP and
TCP, ports are also stored and flags for TCP. In addition, there is a
field that identifies the sensor supplying the data and several router
specific fields, input and output interface identifiers and the next
hop IP address. Since the analysis is largely based on addresses
protocols, volumes, etc. any data with appropriate fields can be
transformed into a form suitable for this analysis. I have developed
prototype tools to convert packet data from TCPdump or similar sources
into degenerate flows (1 flow record per packet), and to convert text
records in the format produced by the flow listing routine back into
flows. Both of these prototypes need better integration into the tool
set. In particular, the packet tool needs to be integrated into the
flow filtering tool as a "front end" that can merge data obtained from
multiple sensors in an enterprise network, eliminating duplicate
packets and tagging flows with the sensor that captured them. The
integrated tool would also have the ability to filter based on packet
contents and would output both packet and flow records for packets
that satisfy the filtering criteria. These capabilities will allow,
for example, data associated with a particular malicious code to be
extracted for more detailed analysis using both flow based and packet
criteria. The log data prototype accepts only a fixed format, so
auxiliary scripts are necessary to manipulate the log data prior to
conversion. A proper program would allow the user to specify the
format of the log data and the source for each field in the flow
record, including default values for fields not present in the log.
Experience to date with the Silk Tools has shown them to be a valuable
addition to the security analysts toolkit, however, many of their
features could use improvement and there are a number of additional
tools that could be added to the suite. This project will generalize
a number of features that are already present in the tool set and add
several new features. The list of possible enhancements given below
is illustrative. many others are possible.
As an example, the bag or multiset tool is able to create counted sets
(with counts based of flow records, packets, or bytes) from a number
of parameters, source IP, destination IP, protocol, source port, and
destination port, but lacks the ability to do this for other scalar
fields such as TCP flags, sensor ID, input or output interface ID or
next hop IP. Since these fields can be assigned arbitrary values when
the data comes from non flow sources such as packet capture files or
log records, the ability to cluster and count values for these fields
is useful for characterizing the data or added meta data. In
addition, ports are overloaded, appearing in both TCP and UDP records,
and the ability to create separate multisets for each ported protocol
would be useful. One of the port fields is further overloaded being
used to hold ICMP code and message information for that protocol, and
the ability to create a multiset of this information would also be
useful. Similar enhancements might be useful for the set tool which
can only create IP sets (source, destination, next hop) at present.
At one point, I experimented with a partioning tool based on
Bloom filters that allowed me to identify multiple flow records having
the same connection parameters, i.e. source and destination addresses
and ports. It had the ability to interchange source and destination
information to bring together both sides of an exchange. A
generalization of this tool, with the ability to include arbitrary
fields or partial fields (e.g. masked addresses that allow all hosts
on a subnet to be considered as one) would be a powerful addition to
the tools.
A flow consolidator would also be useful. Long or intermittent
sessions are spread over multiple flow records due to timeout policies
in netflow. Identifying flow records that belong to the same session
and merging them would be useful, as would bringing the two sides of a
bidirectional flow together into a single record. In practice, the
amount of noise in the data makes any straightforward consolidation
difficult. Bloom filters can be used to identify data that can be
merged and eliminate data that has no matches. Since the latter may
be as much as 90% of the total, this makes the matching process much
more efficient.
There are a number of performance related enhancement that would be
worth including. For many analyses, files containing gigabytes of
data are processed. The processing time is dominated by disk read
time (jobs are heavily I/O bound) when the work is done on a fast
processor with lots of memory. Two changes would greatly speed
processing. One is to give all the tools the ability to read and write
compressed data files. In many cases, the jobs remain I/O bound even
when the increase in processing and the decrease in I/O are taken into
account. The other is to extend the pipe lining capability built into
some, but not all of the programs. This would allow all programs to
pass their input, unmodified, to a pipe where it would be buffered in
memory for the next program in the chain. This has proven to be
highly effective for the filter program and a few others and could
easily be added to any program that operates on the flow data
formats. There are a number of cases where better choices of data
structures would make for substantial speedups, especially in the
multiset tools.
As originally designed, the SiLK tools were intended for retrospective
analysis, but the concepts appear to be useful for real time or near
realtime monitoring of networks. Using real time packet capture data
or a modified flow collector, it would be possible to pipe data
through the packet conversion and filtering tool in a continuous
stream. Tools such as the set and multiset tools could be configured
to pass on their results on a periodic basis, based either on a
relatively short time interval or the occurrence of a predetermined
flow volume. the time series of sets and multisets thus produced could
then be displayed and input into other tools that would compare the
current results with historical data and alert the user when network
properties differ from historical or expected values. The advantage
of this approach is that it allows parallel capture and examination of
a large number of network parameters so that the context for changes
can easily be seen. The short term sets and multisets can also be
aggregated to provide a basis for long term analysis and trending.
The fact that it does not capture payload information is both a
strength and a weakness of the SiLK tools approach Omitting payload
makes the storage requirement much less, but makes it difficult to
determine if two flows were identical or similar. For some malicious
code, the combination of protocol, port, and length may provide a
strong indication that a given set of flows were equivalent, but
stronger evidence is desired. For packet or multiflow sessions, it
would be useful to have a payload characterization that would compose
as records are combined. We have considered a number of possibilities
ranging from computation of 8 bit checksums over pure payloads to byte
profiling in the manner of Sal Stolfo's Payl system which uses byte
frequencies. The checksum has the advantage of composing, but
provides no insight into the nature of the contents. Payl can require
up to 256 counters, but reasonable clustering can be recorded more
compactly, but it is unclear whether the more compact representations
compose. The thrust of this project would be to investigate methods
for providing traffic characterization at the session level in ways
that can be recorded in no more than a few bytes. This might take the
form of a small feature vector or one of a small set of predefined
characterization values.
The next two projects assume that a source of suitable data is
available. We have some data sets available, but it would be
preferable to do these projects with a real, continuously captured
live stream.
We can use counted sets to track hosts who are active and
consolidate their activities over some period of time. Similar
aggregations can track IP level protocol usage and service usage for
TCP and UDP. These sets and multisets can be computed at regular
intervals and used for trending, identification of anomalous changes
in behavior, and the like. We note that a substantial portion of the
traffic that we have seen in the past is very low frequency with a
large number of host generating a single flow per hour and not
returning for days, weeks or months, if at all. Many of these
observations may represent spoofed addresses. One objective of the
study will be to try to understand this class of traffic and find
efficient ways to keep it for long term analysis since it may contain
stealthy malicious activity.
The final two topics are related, but it seems reasonable to factor
the problem into two components.
This project will develop criteria for the generation of alerts that
can serve as trigger events for the traffic capture device described
in the previous section. One possible such alert source would be an
IDS system such as Bro, Snort, or proprietary systems. Other sources,
including traffic anomalies such as those developed from traffic
profiling and monitoring of the sort described in topic 6, and the
results of honeypot interactions. An objective of the project will be
to balance the precision of the alerts against the need to develop
them in a timely fashion so that relevant traffic is still available.
Last modified: Mon Mar 13 12:47:17 AST 2006