Knuth, 1973 
Minimizing the amount of overhead imposed by profiling is critically important for the clever compiler, since in the quasistatic computing model, the end-user (and not just the programmer) will always be subjected to that overhead.
How to instrument a program to extract performance information is a major topic in its own right; I briefly describe the commonly available profiling mechanisms available to C programmers on UNIX platforms --- prof, gprof, and pixie --- in Appendix . The different techniques have their pros and cons, but generally speaking, there's a tradeoff between the amount and accuracy of information gathered, and the overhead incurred at program run-time. The percentages shown in Table are in accord with the ``common wisdom'' --- 5--10% for prof, 10--15% for gprof, and 100--200% for pixie.
Note that the tremendously outsized amount of overhead for pixie is due to the fact that instrumenting every single basic block gathers far more information than is necessary most of the time. One way to keep overhead reasonable is to gather exactly what is needed to perform the desired performance analysis and no more; by building the static call graph beforehand, one can reduce the amount of instrumentation added to the program, and still gather a basis set of data that can be used to generate a full weighted call graph. [,] However, even the overhead required to gather such a basis set can still be improved upon: although these general-purpose tools cannot a priori predict what information the human programmer will be interested in, most of the time the human programmer doesn't really need to see the entire weighted call graph.... However, an user interface that allows the programmer to specify what profiling information is actually desired, thus potentially reducing the run-time overhead, has not been considered or implemented, to my knowledge; not surprising, since under the current paradigm for how these tools are used, even a several-fold performance hit may be acceptable if profiling is only turned on briefly to run the program against some small but hopefully representative sample of program inputs.
Table: Profiling overhead for eqntott run on the reference SPECint92 input file.
Existing profiling mechanisms have several drawbacks when viewed in the light of their use by a smart compiler in a quasistatic computing environment. To minimize run-time overhead, the profiling mechanism should permit very selective instrumentation, at exactly the points of interest. None of the mechanisms mentioned up to this point make selective instrumentation easy.
In addition, the mentioned mechanisms all attempt to measure or estimate CPU time spent executing the program; however, what the typical user today really cares about is clock time, not CPU time (the days of being charged by how many seconds of CPU time are used being long past for most users). For example, in the load-balancing example on page , the goal of the clever compiler should not be to choose the balancing which minimizes the amount of local computation done, but rather the balancing which minimizes the time that the user has to wait for the results of the computation; otherwise, the quasistatic decision might end up always pushing all the work to the remote servers, even though this might make the user wait longer on average for the results of their computation.
The UNIX system call gettimeofday() could in theory provide the current clock time for profiling purposes; however, system calls are relatively expensive, and hence requiring that added instrumentation code perform a system call implies potentially a substantial overhead. Furthermore, frequently the granularity of the time value received back is inherently coarse due to the representation type of the time value, and is not guaranteed to be accurate even to the limit of the representation's resolution.
However, a number of newer processor architectures make the real time available to programs very inexpensively. For example, the new Intel x86 RDTSC instruction, very briefly mentioned on page 10-8 and A-7 in the Pentium architecture manual , and more throughly discussed in , is intended to provide a guaranteed monotonically increasing timestamp, but happens to be implemented on the Pentium as a count of the number of clock cycles elapsed since power up. Similarly, SPARC Technology Business indicates in a white paper that the UltraSPARC processor will have a similar feature:
However, using such timestamps to implement real-time stopwatches to profile program performance also has at least several major flaws. First, it introduces noise which would not be visible for profiling mechanisms attempting to measure CPU time spent executing the program. On a workstation running a multi-tasking operating system and possibly providing services to other machines on the network, the clock time a local-CPU-bound program takes to run to completion with identical inputs can easily fluctuate wildly, depending on other demands are being placed on the workstation. Fluctuations of an order of magnitude is not unheard of under generally heavy load conditions. Possible heuristics to deal with this and other sources of noise will be discussed later. Second, such timestamps are not yet available on most processors in use today, hence real-time profiling is not very portable.
Table: Gathering precise profiling data using RDTSC instruction.
Table demonstrate the kind of information which can be garnered using the RDTSC instruction on a Pentium (see also the source code in Figure on page ). Note that the real-time aspect means that the cache effects of calling a function for the first time as compared to later calls is easily visible; also, note that the overhead of the profiling code itself is more than an order of magnitude less than making system calls to get the current time, thus making this technique more practical for instrumenting code than using system calls to obtain the real time. Figure also uses three very useful features of the GNU gcc compiler being used as the back-end for my modified compiler:
Because of these gcc features, I could write functions to start, stop, and update a stopwatch counter mostly in C. To have a particular region of code profiled, I merely had to (write SUIF manipulation code to) bracket that region in the SUIF file with ``calls'' to qp_start() (start a stopwatch counter) and qp_stop() (stop and update a stopwatch counter). The output of s2c would then contain those function calls. However, when I then compiled the output of s2c with gcc after including the qprofiling.h header file with the inline-marked definitions for qp_start() and qp_stop(), gcc would take care of inlining the bodies of the functions wherever they were called. (If inlining were not done, the overhead of profiling a region of code would increase by 26 cycles, 13 cycles for each function call.) This was much simpler than having to bracket the region to be profiled with the bodies of qp_start() and qp_stop() directly in the SUIF file, and made debugging the contents of qp_start() and qp_stop() relatively painless.
Writing the add_profiling compiler pass was not difficult, given qp_start() and qp_stop(). The pass walks the AST for each function, searching for annotations which indicate a qif or qinfluence statement, and then brackets the appropriate clauses with the profiling ``calls'' --- the integer UID created by the integrate pass for each set of interacting constructs is used in the calls to designate which stopwatch counter should be started and stopped. The add_profiling pass also modifies the main() function in the SUIF file to call the initialization routine qp_initialize() and the termination routine qp_dump(); the latter saved profiling information to a file in raw form (to be processed into meaningful format later), and is also responsible for re-invoking the clever compiler when enough profiling information has been accumulated.
One complication only discovered during implementation was non-linear flow control; for example, a qinfluence statement's block may contain longjmp() calls, exit() calls, goto, continue, break, or return statements, any of which could cause the qp_stop() call matching the entrance qp_start() call to not be executed. Calls to exit() were particularly problematic, since they would cause the qp_dump() routine not to be executed, and no profiling information from the run would be saved. The current add_profiling pass has special handling for only one of these non-linear flow control cases, the return statement, which I expected to be the most commonly encountered. In principle, the others can be dealt with similarly. A return statement which appears inside a qif clause or qinfluence block of the form
is transformed into
The temporary is introduced so that if expression would take a long time to evaluate, the time spent would continue to be stopwatched. Any return statements in the main() function are similarly transformed to ensure that qp_dump() is called before program termination. See Figures and .
Figure: Example: hello.c.
Figure: hello.c after being processed by integrate, add_profiling, and select_alternatives passes.