Automated Network Monitoring
Text from presentation made at the M.I.T. Workshop on Internet Survey Methodology and Web Demographics, 29-30 January 1996, Cambridge, MA, USA
30 January 1996
Marc Abrams and Stephen Williams
Computer Science Department
Virginia Tech, Blacksburg, VA 24061-0106
abrams@vt.edu, williams@csgrad.cs.vt.edu
http:/www.cs.vt.edu/~chitra/www.html
[Link to postscript version of powerpoint slides.]
Taxonomy
- Logging: Shows what people do
- Surveying: Shows who's doing it
The Four Logging Methods
-
By "Footprints" (needs software on each client)
-
At Proxy (only works with client preferences)
-
At Server (easy, since Web servers do it)
-
By Network Monitoring:
(listen to all HTTP packets on a broadcast network)
Automated Network Logging
What Network Monitoring Can Do
-
Transparency: No changes to any client or server are required
-
Security: No one has access to data except owner of monitor machine (such as auditor)
-
Performance: Monitor never writes packets
Dark Side of Network Monitoring
Employer could monitor what employees are doing.
(But did you realize this is could already be happening without network monitoring?)
Up Side of Network Monitoring
-
No sampling or self-selection problems: You get all document requests.
-
Can compute a rating for set of monitored clients
-
You get more information per URL than from server or proxy log
Log File Record
Our log records contain the following fields; the first six are in the common log format generated by popular Web servers.
Client machine (optional)
Timestamp of GET packet
Command (containing URL)
HTTP version
Return code
Document size
User identity (optional)
Browser information
Gateway or proxy server version
MIME type of document returned
URL linked from (i.e., the parent's URL) Additional resolution of GET timestamp, giving milliseconds Connect time, from GET packet to first response packet (in seconds)
How Network Monitoring Works
-
Attach one or more new machines, each called a monitor, to network with Web server of interest
-
Monitor captures connection request TCP packets
-
Decode HTTP packets in TCP packets
-
Generate "common log format" file with augmented fields
What Could You Do With Client Logs?
-
Everything you do with server logs
-
Create a taxonomy of the types of servers that users visit
-
With a session id added to HTTP:
Could show paths users take through hyperlinks to get to and through a Web site
Application of Automated Monitoring
-
Proxy cache performance study with 5 workloads:
-
Multimedia classroom
-
Undergraduate Computer Science lab
-
Graduate Computer Science lab
-
All client traffic on department backbone
-
All Web server traffic on department backbone
-
Planned studies:
-
Community network: Blacksburg Electronic Village
-
Collaborative Web tool nnin K-12 public schools
-
Non-technical departments on campus
Privacy Issues in Our Monitoring Activities
-
Raw logs (with client host names, URLs) are considered confidential
-
Dedicated monitor machine facilitates privacy
-
We can encode client and URL fields, to distribute logs to researchers
-
We only publish demographic information with permission of subjects
HTTP Protocol Suggestions
- We need:
-
Standard syntax for "User Agent" and Server fields
-
User session id in every HTTP header
(use browser process id)
-
Others may need:
-
Add non-loggable bit
-
New packet type for server to request a proxy send request hits for that server only
Tool Set Available
-
HTTP protocol decode
-
Common log file generator
-
Chitra, a tool for trace visualization and statistical analysis for log files
See http://www.cs.vt.edu/~chitra/www.html.
Appendix -- More Details
Details of Method
-
Use tcpdump to capture initial 512 bytes of data of each TCP packet establishing:
-
client to HTTP server connection, or
-
HTTP server to client connection
-
Stream tcpdump output through HTTP protocol decode and generate intermediate log file
-
Post processing matches HTTP Gets with replies to construct "common log format" file
Volume of Collected Data
Department backbone traffic workload:
-
80Mb/day, raw
-
80MB/mo, to disk
-
50Mb/mo, CLF
Error Sources in Resultant Log
-
Dynamically created Web pages are assigned size zero in HTTP header...
Only important if you want to analyze document sizes
-
An occasional HTTP server doesn't know what time it is...
Only important if you want to analyze log by time of day, day of week, etc.