Automated Network Monitoring

Text from presentation made at the M.I.T. Workshop on Internet Survey Methodology and Web Demographics, 29-30 January 1996, Cambridge, MA, USA 30 January 1996

Marc Abrams and Stephen Williams

Computer Science Department
Virginia Tech, Blacksburg, VA 24061-0106

abrams@vt.edu, williams@csgrad.cs.vt.edu

http:/www.cs.vt.edu/~chitra/www.html

[Link to postscript version of powerpoint slides.]

Taxonomy

Logging: Shows what people do
Surveying: Shows who's doing it

The Four Logging Methods

By "Footprints" (needs software on each client)
At Proxy (only works with client preferences)
At Server (easy, since Web servers do it)
By Network Monitoring: (listen to all HTTP packets on a broadcast network)

Automated Network Logging

Computer network diagram showing what components we log

What Network Monitoring Can Do

Transparency: No changes to any client or server are required
Security: No one has access to data except owner of monitor machine (such as auditor)
Performance: Monitor never writes packets

Dark Side of Network Monitoring

Employer could monitor what employees are doing.

(But did you realize this is could already be happening without network monitoring?)

Up Side of Network Monitoring

No sampling or self-selection problems: You get all document requests.
Can compute a rating for set of monitored clients
You get more information per URL than from server or proxy log

Log File Record

Our log records contain the following fields; the first six are in the common log format generated by popular Web servers.

Client machine (optional)
Timestamp of GET packet
Command (containing URL)
HTTP version
Return code
Document size

User identity (optional)
Browser information
Gateway or proxy server version

MIME type of document returned
URL linked from (i.e., the parent's URL) Additional resolution of GET timestamp, giving milliseconds Connect time, from GET packet to first response packet (in seconds)

How Network Monitoring Works

Attach one or more new machines, each called a monitor, to network with Web server of interest
Monitor captures connection request TCP packets
Decode HTTP packets in TCP packets
Generate "common log format" file with augmented fields

What Could You Do With Client Logs?

Everything you do with server logs
Create a taxonomy of the types of servers that users visit
With a session id added to HTTP:
Could show paths users take through hyperlinks to get to and through a Web site

Application of Automated Monitoring

Proxy cache performance study with 5 workloads:
- Multimedia classroom
- Undergraduate Computer Science lab
- Graduate Computer Science lab
- All client traffic on department backbone
- All Web server traffic on department backbone
Planned studies:
- Community network: Blacksburg Electronic Village
- Collaborative Web tool nnin K-12 public schools
- Non-technical departments on campus

Privacy Issues in Our Monitoring Activities

Raw logs (with client host names, URLs) are considered confidential
Dedicated monitor machine facilitates privacy
We can encode client and URL fields, to distribute logs to researchers
We only publish demographic information with permission of subjects

HTTP Protocol Suggestions

We need:
1. Standard syntax for "User Agent" and Server fields
2. User session id in every HTTP header
  (use browser process id)
Others may need:
1. Add non-loggable bit
2. New packet type for server to request a proxy send request hits for that server only

Tool Set Available

HTTP protocol decode
Common log file generator
Chitra, a tool for trace visualization and statistical analysis for log files

See http://www.cs.vt.edu/~chitra/www.html.

Appendix -- More Details

Details of Method

Use tcpdump to capture initial 512 bytes of data of each TCP packet establishing:
1. client to HTTP server connection, or
2. HTTP server to client connection
Stream tcpdump output through HTTP protocol decode and generate intermediate log file
Post processing matches HTTP Gets with replies to construct "common log format" file

Volume of Collected Data

Department backbone traffic workload:

80Mb/day, raw
80MB/mo, to disk
50Mb/mo, CLF

Graph of Daily WWW Traffic on CS Backbone vs. time omitted

Error Sources in Resultant Log

Dynamically created Web pages are assigned size zero in HTTP header...
Only important if you want to analyze document sizes
An occasional HTTP server doesn't know what time it is...
Only important if you want to analyze log by time of day, day of week, etc.