Footnotes

...group

Which evolved from the Transit Project: for more information, see the World Wide Web URL http://www.ai.mit.edu/projects/transit/rc_home_page.html (UPDATED).

...compilation

The term ``compilation'' is used loosely here; even if the software package is written in an interpreted language, there is still a point in time when conceptually the program is no longer viewed as source code but as an application to be used.

...process,

In which the programmer examines reams of profiler output to evaluate the resulting program's performance, formulates alternative code for ``hot spots'', and iterates, and iterates, and iterates....

...effort.

Another aspect of annotations to allow the programmer to explicitly encode different alternatives in the source code is that such annotations serve to document, in a structured fashion, what alternatives and trade-offs have already been considered by the programmer.

...2

Or the less general but more common ifdef directive.

...compiler,

Although the programmer certainly should choose a name which he or she finds semantically meaningful, for code readability.

...tag

For convenience, it may be desireable to introduce additional syntax, e.g. unique:LARGE_SORT, to allow the programmer to specify that different call sites should be analyzed separately, but still use the same identifier that has semantic meaning to a programmer reading the source code.

...example.

See [], which presents Typesetter, a system specifically to select between different implementations of abstract data structures based on profiling feedback. Also, the documentation for libg++, a freely distributable / library provided by the Free Software Foundation for use with the GNU / compiler, includes some pragmatic discussion and statistics about the expected efficiencies of the various operations on different implementations of the abstract data types; also discussed is the improvement in performance when the programmer knows and indicates in the source code exactly which representation is being used for a particular variable of an abstract type at a given point in the code, so that the compiler can skip generating run-time checks to determine the actual representation and instead generate code to immediately dispatch to the appropriate implementation.

...characteristics

Static on any given machine, sans field upgrades, but widely varying on different implementations of a binary-compatible instruction set architecture.

...package.

Derived from the MIT LCS c-parser, available as ftp://theory.lcs.mit.edu/pub/c2c/c2c-0-7-3.tar.Z.

...annotated

SUIF supports attaching annotations containing arbitrary data to the structures which represent various parts of the parsed C program; annotations can be attached not just to nodes in the abstract syntax trees (representing the statements and expressions), but also to declarations and definitions in symbol tables.

...platforms

Profilers for MS-DOS and Windows platforms are likely to be using similar techniques.

...expensive,

For example, actual timings of 1,000,000 iterations (normalized for the loop iteration overhead) indicate 282 elapsed CPU clock cycles per call to time(NULL); 876 cycles per call to gettimeofday(&tv,NULL); and 766 cycles per call to getrusage(RUSAGE_SELF, &ru). Much of this is probably due to context switches; by contrast, it costs 9 clock cycles per call to a dummy() function that simply returns its input parameter --- and viewing the assembly code output verifies that gcc has not merely silently inlined the function call.

...instruction

Actual timings (normalized for the loop iteration overhead) of 1,000,000 iterations of the RDTSC instruction indicates it executes in 11 CPU clock cycles on the Pentium implementation.

...paper

http://www.sun.com/sparc/Performance.html (UPDATED).

...painless.

However, see Table

on page

and related discussion in text for further details on the performance of this implementation.

...saved.

Although not implemented, it has come to my attention that a number of C compilers provide a library routine on_exit() which allows the registration of routines to be run when exit() is called. That would have dethorned this particular problem, but the more general problem still remains.

...incomplete:

I decided that having some examples written up would be more interesting than having a more sophisticated search algorithm implemented.

...system

More specifically, by system I mean the qsort() implementation located in the default libc.a library used by the default linker when invoked by the C compiler driver.

...incorrect.

This is worth noting, because it is important that the programmer supplying alternative code ensure that the alternatives are semantically equivalent. However, sometimes programmers and users are prone to rely on non-guaranteed semantics of a particular implementation --- such as the sort stability of qsort(), for example.

...seconds.

Correctness was partially verified by checking the output of eqntott using the Linux system qsort() against the output of eqntott using an insertion sort; the outputs were identical.

...usage.

By complication, I mean that the library writer could implement two functions qsort1() and qsort2(), and document that the latter is preferred over the former for expensive comparison functions for faster performance, but otherwise the two behave similarly. This complicates the semantics since the application programmer now has to think about which function he or she wants to use from each different call site in the program.

...it.

It would be even nicer if the compiler could be instructed to generate and try all the different permutations for the programmer, but this would require additional special quasistatic syntax. Note that GNU gcc already supports an extension to C which allows the programmer to designate any programmer-defined function as being const, i.e., side-effect free and dependent on the input arguments only. This extension currently allows gcc to perform common subexpression elimination and loop hoisting on such functions; however, this can also be used to get us part of the way to a compiler which understood when it could rearrange boolean expressions to arrive at a semantically equivalent formulation with faster average performance.

...accuracy

Mostly, detecting when round-off errors in performing computations are likely to make the numeric results meaningless. Numerical accuracy issues will not be considered for the remainder of this section.

...#figurematrixtest#1437>.

Observation: once again the qif statement syntax is not powerful enough to permit an elegant expression. What we'd like to be able to do is to tell the compiler that there are N predicates and N actions to take should the corresponding predicate be true, plus a default should none of the other predicates be true. Then the compiler could construct all possible orders in which to evaluate all possible subsets of the predicates. This is similar to the boolean clause ordering problem with eqntott.

...stages

Also, a number of operating system's linkers will rearrange functions and static data structures to reduce the number of cache collisions based on profiling feedback; some third-party products can perform this same function on already-linked executables.

...MacOS).

Out of these example systems, some versions of DOS do at least provide a BASIC interpreter, and OS/2 provides a REXX interpreter.

...portable.

There are minor problems with transporting intermediate format files from one platform to another because the intermediate format files contain declarations from system header files, the exact types of which may vary slightly from platform to platform.

...function

Possibly under some implementations the additional prologue code and the storage slot created even for prof-style profiling are a no-op and go unused, respectively.

...architecture.

For example, [] spends several chapters discussing the dramatically different performance characteristics of instructions sequences on an Intel i386 vs. i486 vs. Pentium. A i486 is RISC-like in that many more instructions will execute in just one cycle than on an i386, and the Pentium is dual-issue superscalar; any pixie-like tool would have to take this into account in order to calculate meaningful ideal times lest a sequence of instructions carefully tuned for the Pentium's two pipelines be unfairly assigned far more time than it actually takes to execute because pixie was using cycle counts for instructions executing on an i386. Along similar lines, [,] look at instruction set utilization and the difference in SPEC92 benchmark results for MIPS- and SPARC-based workstations when compilers were given command-line options to permit them to assume later implementations of those chip architectures.

...table.

qsort() is not sufficiently parameterized; if qsort()'s interface had included passing in a function to perform element swapping, then it would be possible to use qsort() itself for this purpose.

Reinventing Computing, MIT AI Lab. Author: pshuang@ai.mit.edu (Ping Huang)