Load Trace Playback Tool ======================== contact: pdinda@cs.cmu.edu http://www.cs.cmu.edu/~pdinda/LoadTraces http://www.cs.cmu.edu/~pdinda/LoadTraces/playback.html THIS SOFTWARE COMES WITH NO WARRANTY OF ANY KIND PERMISSION TO USE THIS SOFTWARE FOR NONCOMMERICIAL PURPOSES IS HEREBY GRANTED TO ALL INTERESTED PARTIES PROVIDED ATTRIBUTION IS MADE. USE, DISTRIBUTION, OR REDISTRIBUTION FOR COMMERICIAL PURPOSES REQUIRES SEPARATE LICENSING FROM THE AUTHOR. Overview -------- This document describes playload, a tool for playing a captured load trace on a different machine. "Playing load trace" means generating work (via dummy loops executed by a gang of processes) on the playback machine that reflects the load average values sampled in the trace. As a side effect, the load average on the playback machine will approximate the load average samples in the trace. If the machine has no other load, then the load average will approximately track the trace's load average. Playload was designed primarily to play 1 Hz load traces captured on Digital Unix machines on other Digital Unix machines. However, you can also use it to play my 1 Hz Digital Unix traces on *any* Unix machine, and (although this mostly untested) it should be able to play traces sampled at any frequency, even non-periodically, on any Unix machine on any other Unix machine. We have tested it on Digital Unix, Linux, FreeBSD, and Solaris machines. A large number (800 MB) of 1 Hz Digital Unix traces are available from http://www.cs.cmu.edu/~pdinda/LoadTraces/. The raw traces are binary and are a DEC Alpha byte order. They are also available in a simplified, combined form in network byte order, and some traces are also available in ascii. Operation --------- To describe the operation of the system, we'll focus on periodically sampled trace. The load average is an exponential average of the number of processes in the system's ready-to-run queue. Conceptually, the length of the ready queue is periodically sampled, these samples x[i] flow through an exponential filter z[i+1] = (exp(-T/tau))*z[i] + (1-exp(-T/tau))*x[i] where T is the sampling interval and the application-visible load average is the z[i] series. A load trace is gathered by periodically sampling the load average and recording the time-stamped samples to a file. The sample rate should be at least twice is high as the sample rate of the in-kernel process, so we rewrite the above equation like this: trace[i+1] = (exp(-T/(2*tau))*trace[i] + exp(-T/(2*tau)))*x'[i] In load trace playback, we want the measured load average to track the load average samples in the trace file. To do this, we treat the x'[i] in the above equation is the expected run-queue length during the last T/2 seconds. To determine the x'[i], we "deconvolve out" the smoothing function using the method described in @InProceedings{WINNER-WS-98, author = "Olaf Arndt and Bernd Freisleben and Thilo Kielmann and Frank Thilo", title = "Dynamic Load Distribution with the Winner System", booktitle = "Proceedings of Workshop Anwendungsbezogene Lastverteilung (ALV'98)", year = "1998 ", pages = "77--88", month = "March", note = "Also available as Technische Universität München Technical Report TUM-I9806", } All that is necessary to do this is knowledge of the smoothing constant tau for the machine on which the trace was taken. For Digital Unix, tau is 5 seconds. On most other Unix systems, tau is 60 seconds. A larger tau value indicates that the load average will behave more smoothly. A value x'[i] represents the amount of contention the CPU saw for the the time interval. To reproduce this contention, we split the interval into smaller subintervals, each of which is bigger than the scheduler quantum and stochastically assign subprocesses to either work or sleep during the subintervals. For example, suppose x'[i]=1.5. We would assign subprocess 0 to work during a subinterval with probability 1, and subprocess 1 to work during a subinterval with probability 0.5. After each individual x'[i] value has been played, the load average on the system is measured. This measurement, measure[i], is compared against the load trace *as it would have appeared given the tau of the playback machine* trace'[i]. trace'[i] is determined by smoothing the x'[i] series with an exponential averaging filter having the playback machine's tau. Interacting With Other Load --------------------------- Most likely, there will be other processes on the system, and so it is important to understand how the load the playload tool generates reacts to it. Two modes of operation are possible, time-based and work-based. In the time-based mode of operation, the contention probabilities implied by the x'[i] value last only till the end of x'[i]'s interval in real time. Essentially, this means that the other load on the machine can only amplitude modulate the played back load. For example, suppose there is a uniform 1.0 load on the machine and the load trace dictates a 1.0 load from 0 to 1 second and zero elsewhere. Then, the measured load will (ideally) be 2.0 from 0 to 1 second and 1.0 elsewhere. In the work-based mode of operation, the load is interpreted as work that must always be done, but which may be slowed down by other load. This means that the other load on the system can also frequency modulate the played load. Using the same example as the previous paragraph, the measured load would be 2.0 from 0 to 2 seconds and 1.0 elsewhere. Feedback -------- By using error feedback, playload can sometimes more accurately track the load average in the trace. However, this feedback mechanism will also try to correct for other load on the system, trying to make the *total* load track the trace load. You almost certainly do not want this. Compiling and using playload ---------------------------- To compile playload, you need a C compiler, ideally gcc. If you have gcc just run make. If not, you may need to edit Makefile. playload is run as follows: % playload tauoftrace tauofhost feedback time|work Hz|nonperiodic alpha|network|text tracefile tauoftrace - the smoothing time constant from the machine where the trace was taken tauofhost - the smoothing time constant of the machine on which playload is run feedback - the feedback level (almost certainly negative) time - indicates playback should be time-based work - indicates playback should be work-based Hz - a number indicating the sampling rate of the trace or the sampling rate you want to use nonperiodic - indicates the timestamps of the trace should be used alpha - indicates the trace is in DEC alpha byte order network - indicates the trace is in network byte order text - indicates the trace is in ascii tracefile - name of the trace file For example: % playload 5.0 5.0 0.0 work 1.0 network axp0.23.97.trace This plays the network byte order trace in the file axp0.23.97.trace using the work-based approach. The tau of the trace is 5.0 and the tau of the host is 5.0. The sample rate of the trace is 1.0 Hz. This is correct for playing back a Digital Unix trace on another Digital Unix machine. The feedback level is zero, which is usually the best with this version of playload. Another example: % playload 5.0 60.0 0.0 time 1.0 text axp0.23.97.txt This plays back the text trace in axp0.23.97.txt using the time-based approach. The other difference is the host's tau is 60 seconds, which is reasonable for most other unix systems, such as linux. Trace Format ------------ playload supports three file formats: text, alpha, and network. In the text format, each line of the file contains two numbers separated by whitespace. The first number is the timestamp and the second number is the load average for at that timestamp. Timestamps are ignored unless the nonperiodic option is used. The alpha format is binary. The file consists of pairs of double precision (64 bit) floating point values stored in the native byte order (LSB) on dec alphas. The first number of each pair is the timestamp while the second is the load average for that timestamp. This is the format most of our load traces are available in. The network format is binary. The file consists of pairs of double precision (64 bit) floating point values stored in network byte order (MSB). The first number of each pair is the timestamp while the second is the load average for that timestamp. Output ------ Each sample in the load trace generates an line of output on standard out. Each line consists of 8 numbers separated by whitespace. The first line of output is special. It begins with a comment character (#) and describes columns. The columns are: Time - the time immediately after finishing playing the load sample Desired - what the load average sample from the trace would have been given this machine's tau and external load Measured - the corresponding actual load average on the machine measured immediately after finishing playing the sample Target - The "deconvolved" load that was applied for various reasons this can sometimes be negative Request - MAX(0,Target) - what playload actually tried to play DMeasured - The "deconvolved" load that was measured Error - The error - Target-Measured Percent - Percentage error Additional Tools ---------------- Also included in this distribution is a perl script, plot_playback.pl, which can be used to compare the load trace, the desired load measurements during playback (recall that the desired load is basically the load average measurements we expect given this machine's tau), and the actual measured load during playback. In addition to a statistical summary, plot_playback.pl will also produce an encapsulated postscript plot of the three signals. You will need gnuplot for this to work. A tool called findtau will also be produced. This can be used to estimate the tau of a host. It's generally easier to apply the rule of thumb that mach-derived OSes have tau=5.0 and others have tau=60.0, but if you suspect that your machine is different, you should try using findtau. The routines in the file LoadTrace.c may useful to you for reading and writing load traces in various formats. Why Load Playback is Imperfect ------------------------------ You will probably discover that the measured load during playback tracks the desired load only approximately in some cases. To see an extreme example of this, try playing back a load trace with a continuous value of 0.5 - the mean measured load will be 0.5, but you'll see quite a bit of variability. This section explains why. The sampling process that produces the x[i]s in the above equation is the OS's scheduler, which sees only integral numbers of processes on the ready queue. We treat our approximation of x[i], x'[i], as the expected value of x[i]. So, if x'[i]=0.5, we have a subprocess spend 50% of its time sleeping and 50% working, and we alternate randomly between these extremes. The expected run queue length the scheduler would see is then, E[x[i]] = 0.5*1 + 0.5*0 = 0.5. However, the scheduler will really see either 1 or 0. The effect on the load average in either case will be drastically different than the expected value resulting in error. Furthermore, the load is an exponential average, the error will persist over some amount of time (it will decline to 33% after tau seconds). Another way of thinking about this is to consider the second moment, E[x[i]^2]. For this example, E[x[i]^2] = 0.5*1^2 + 0.5*0^2 = 0.5, so the standard deviation of the distribution of E[x[i]] is sqrt(E[x[i]^2]-E[x[i]]^2) = sqrt(0.5-0.25) = 0.5 which explains the variability we see. But, WAIT! What if the sample rate of the trace or of playback is too low and the x'[i] value we compute or its corresponding measurement is actually the result of more than one observation by the scheduler? Won't that make the measured value closer to the expectation? Yes, it will, but there will still be error that gets propagated over time through the exponential filter. To extend the example above, suppose the scheduler takes the average of n observations to determine x[i]. The expected value of the sum of those observations, E[(x_1+x_2+...+x_n)] will be n/2, which, divided by n gives us an expected load of 0.5 as we desire. The sum will be distributed according to the binomial distribution, so the standard deviation of the sum will be sqrt(n*0.5*(1-0.5)) = 0.5*sqrt(n), and the standard deviation of the of the average will be 0.5/sqrt(n). Another source of error is that an actual process on the ready queue may not consume its full quantum. Why Do Real Traces Play More Accurately Than Synthetic Ones? ------------------------------------------------------------ In almost all cases, when a real load trace that has been sampled at a sufficiently high rate is played, the x'[i]s are close to integers. Intuitively, this makes sense - the scheduler can only observe integral numbers of processes on the ready queue, and if our estimates of its observations (the x'[i]s) are accurate, they should also mostly be integers. It's easy to construct synthetic traces that are actually not sensible in terms of short term behavior. One such trace is the continuous 0.5 example discussed in the previous section. Deconvolving the trace produces x'[i]s slightly above 0.5. Playload reasonably produces a situation where the expected ready queue length is 0.5 by having a process spend half of its time working and half sleeping. However, the observations the kernel (the x[i]s) will make will be either 0 or 1. Thus the measured load trace will vary widely. The *average* load average will match to the 0.5 we desire, but the short term behavior will not. It's important to note that playload is doing what the user probably *wants* (keeping the CPU 50% busy, which can be verified with vmstat), but that the load average fails to be a good measure for how well the generated load conforms. Another way a synthetic trace can fail is to have very abrupt changes. For example, a square wave will be reproduced with lots of overshoot and error. In order for the abrupt swings of a square wave to have been the output of the exponential smoothing, the input x[i] must have been much more abrupt, and the estimate x'[i] will also be quite large. This means playload has to have many processes try to contend for the CPU at once, which raises the variability considerably. The best way to produce synthetic traces is to create a string of integer valued x[i]s and smooth them with the appropriate tau. Another possibility is to present these x[i]s directly to playload with a very small tautrace value. Why Don't You Split The Load Evenly Among Your Subprocesses? ------------------------------------------------------------ This has to do with minimizing the second moment, E[x[i]^2], which is described above. Suppose we want to reproduce an x'[i] in the range 1 to 2 using 2 processes, a and b. We'll denote that a process a is on the ready queue as a=1 and not on the ready queue as a=0. The probability of process a being on the ready queue is P[a=1]=L/2, and for b it is P[b=1]=L/2. Now, let's compute the expectations: a b a+b (a+b)^2 probability 0 0 0 0 (1-L/2)^2 0 1 1 1 (L/2)(1-L/2) 1 0 1 1 (L/2)(1-L/2) 1 1 2 4 (L/2)(L/2) The expected value of the scheduler's sample, E[x[i]] = E[a+b] = L, as we desire. The second moment, E[x[i]^2] = E[(a+b)^2] = L^2/2 + L. Now consider what happens if we instead choose P[a=1]=1 and P[b=1]=L-1, which is what playload does: a b a+b (a+b)^2 probability 0 0 0 0 0 0 1 1 1 0 1 0 1 1 1-(L-1) 1 1 2 4 L-1 Here, E[a+b] = L, as we want, but E[(a+b)^2] = 3L-2. This is less than L^2/2 + L in the range [1,2) and equal for L=2. Thus, by not sharing the load equally, we increase the likelihood that the scheduler will observe what we want it to observe.