Much of scientific and engineering computation boils down to calculations based on linear systems of equations, commonly represented as matrices. Hence there is great interest in developing sophisticated packages with routines carefully tuned for accuracy and performance. Numerical Recipes , a popular resource for those interested in numerical computation, mentions that such packages (e.g., LINPACK, NAG) often provide many versions of particular routines, which are specialized for special matrix forms (e.g., symmetric, triangular, etc.) and perform faster than the generalized routine on such cases. One can imagine coding using predicates to check for special forms and thus taking advantage of special form routines, even when one is uncertain in advance what special forms the matrices being dealt with might conform to --- see Figure .
Is coding in this fashion worthwhile? Suppose in writing numeric crunching applications, the programmer absolutely knows if matrix A is going to take on a special form or not. If so, then testing at all is completely pointless --- the programmer can directly call the the correct special form routine, or the general routine if no special forms apply. However, if there is some uncertainty in the programmer's mind, as is likely to be the case for example with someone writing an application-area-specific library, then which strategy --- Figure , Figure , or just always calling LU_decompose_general() directly --- gives the best performance requires more careful consideration. The writers of libraries such as LINKPACK and NAG were presumably reluctant to put in additional ``unnecessary'' checking overhead into the general routines themselves, thinking ``Why should we slow down the general library routine with checks for special cases every time, especially if the general form algorithm will always give the correct answer anyway? Let the user check for special cases and call the specialized routines if they want.'' However, the user may similarly be reluctant to slow down his or her code by writing something like Figure . Even if each of the is_*_matrix() predicate functions takes just O() execution time, string enough tests together and there will be noticeable slow-down.
On the other hand, in the quasistatic version in Figure , only if matrix A is often one of the special forms will the price of even a single test continue to be paid at equilibrium (e.g., after the initial experimentation phase); and, if matrix A is not any of the special forms being tested for, then no testing for special forms will be performed at all at equilibrium. Furthermore, note that the typical input matrix might be simultaneously several special forms, and in such cases the quasistatic version would automatically pick out the specialized routine that performs the best on the statistically likely inputs. However, the quasistatic version has the draw-back that if the input matrix is one special form about half the time and another special form the other half of the time (i.e., the entropy of the question ``What special form is this matrix?'' is high), then for Figure the clever compiler must choose the clause for one of the special forms over the other; however, better performance could be had by pay the price of an extra testing predicate to test for both equally-likely special forms before defaulting to the general form. It would be advantageous to at least include one more clause, which simply contains the chain of tests in Figure . Also, whereas in Figure , the price for considering another special form is O() during every execution of this piece of code, the price in the quasistatic case is that there are more versions of the program for the clever compiler to try out, and it will therefore take longer for the clever compiler to converge on a reasonably optimal program.
Figure: Selecting at run-time between all the special-case matrix routines.
Figure: Quasistatically selecting between special-case matrix routines.
A well-known optimization technique is blocking: given multiple levels of memory hierarchies with different latencies, blocking improves the reuse of data loaded into the faster levels of the hierarchy and thus overall program throughput. One application of blocking is to matrix multiplication: Figure contains a simple implementation of blocked matrix multiplication for illustration purposes. In essence, the BlockedMultiply routine takes an matrix and chops it up into blocks of size ; at any given time, the routine places less demand on the cache than an unblocked version because in the innermost loop, it is only accessing information from -sized sub-matrices (blocks) of the source matrices X and Y and the destination matrix Z.  observes that in the worst case (no blocking), words may have to be accessed from main memory, but if B is chosen such that the block of Y and a row of length B of matrix Z can fit in the cache without interference, then the number of memory accesses drops to . Clearly, then, it is desirable to maximize the value of B; the speed of microprocessors has increased to the point in the last few years where the number of main memory accesses (i.e., cache misses) and not the actual computation on the data retrieved can prove to be the performance bottleneck.  goes on to develop models of cache interference for such blocked algorithms, to come up with a more sophisticated way of choosing the block size to maximize reuse; the rationale for such work is that simply choosing B such that a block plus a row of length B might just barely fit within the cache (i.e., considering size only) was demonstrated to be inadequate, as data self-interference under certain cirumstances greatly reduced the cache hit rate and thereby the effective throughput .
Figure: Blocked matrix multiply implementation.
The results in Table show (for both integer and double-precision floating point numbers) that as the blocking size increases, the number of CPU cycles elapsed before the computation finishes decreases nearly monotonically, with most of the performance benefit achieved at relatively small values for the block size. Surprisingly, even after I went past the block size that should have meant the the cache had to have overflown (for example, the integer B=256 case should have just about overflowed the cache, since 32-bit integers take up 256KB of memory), we see that the performance was hardly impacted at all when we moved to B=512. This argues for experimental verification of timings on actual machines where the code will be run, as this is an result that would not have been expected from the models --- my speculation is that the cache pre-fetch was sufficiently effective on this machine that the additional memory demands of too-large a blocking size did not create a bottleneck. In any case, the clever compiler, when allowed to choose from a wide range of blocking values as expressed by a qint parameter, can experimentally determine that it is safe to use very large blocks. However, another machine which was also i386-binary compatible, but with different latencies and bandwidths for transfers between the cache and main memory, might well have choked on the larger block values, hence it is desireable to set the block size quasistatically and not statically.
Table: Multiplying 512x512 matrices with different block sizes.
Table generally corroborates the conclusion that large block sizes are quite safe on this machine; the details demonstrate that measuring elapsed real time rather than UNIX process user time can be important, because they sometimes tell different stories. Certain values for block size (e.g., B=5, B=6, B=7) resulted in drastically poorer elapsed real time performance as compared to neighboring block sizes (mostly due to delays due to page faults), although their user times continued to drop monotonically; somewhat surprising, though, is the elapsed real time performance drop at B=320. (Ironically, B=320 achieved one of the best user times in the table.) A sophisticated model which was given the exact parameters for the virtual memory subsystem for the operating system kernel might or might not have anticipated this lattermost result; however, experimentation by the clever compiler would have revealed it without the need for such detailed knowledge. Another example of the utility of performing measurements of real time as well as user time is comparing B=256 and B=1280: although B=256 took nearly 50 seconds less user time, B=1280 caused sufficiently fewer page faults that it was faster in terms of elapsed time by about 3.25 billion cycles, or about 36 seconds. On this machine B=128 appears to be the overall the fastest, combining the three criteria real time, user time, and page faults; however, on a similar machine outfitted with more or less memory, or on a machine which was configured to swap over the network, a different value might have proved to be optimal.
Table: Multiplying 1280x1280 integer matrices with different block sizes.