Load Trace Playback Tool
========================

contact: pdinda@cs.cmu.edu

http://www.cs.cmu.edu/~pdinda/LoadTraces
http://www.cs.cmu.edu/~pdinda/LoadTraces/playback.html

THIS SOFTWARE COMES WITH NO WARRANTY OF ANY KIND

PERMISSION TO USE THIS SOFTWARE FOR NONCOMMERICIAL PURPOSES IS HEREBY
GRANTED TO ALL INTERESTED PARTIES PROVIDED ATTRIBUTION IS MADE.

USE, DISTRIBUTION, OR REDISTRIBUTION FOR COMMERICIAL PURPOSES REQUIRES
SEPARATE LICENSING FROM THE AUTHOR.


Overview
--------

This document describes playload, a tool for playing a captured load
trace on a different machine.  "Playing load trace" means generating
work (via dummy loops executed by a gang of processes) on the playback
machine that reflects the load average values sampled in the trace.
As a side effect, the load average on the playback machine will
approximate the load average samples in the trace.  If the machine has
no other load, then the load average will approximately track the
trace's load average.

Playload was designed primarily to play 1 Hz load traces captured on
Digital Unix machines on other Digital Unix machines.  However, you
can also use it to play my 1 Hz Digital Unix traces on *any* Unix
machine, and (although this mostly untested) it should be able to play
traces sampled at any frequency, even non-periodically, on any Unix
machine on any other Unix machine.  We have tested it on Digital Unix,
Linux, FreeBSD, and Solaris machines.

A large number (800 MB) of 1 Hz Digital Unix traces are available from
http://www.cs.cmu.edu/~pdinda/LoadTraces/.  The raw traces are binary
and are a DEC Alpha byte order.  They are also available in a
simplified, combined form in network byte order, and some traces are
also available in ascii.


Operation
---------

To describe the operation of the system, we'll focus on periodically
sampled trace. 

The load average is an exponential average of the number of processes
in the system's ready-to-run queue.  Conceptually, the length of the
ready queue is periodically sampled, these samples x[i] flow through
an exponential filter

            z[i+1] = (exp(-T/tau))*z[i] + (1-exp(-T/tau))*x[i]

where T is the sampling interval and the application-visible load
average is the z[i] series.  A load trace is gathered by periodically
sampling the load average and recording the time-stamped samples to a
file.  The sample rate should be at least twice is high as the sample
rate of the in-kernel process, so we rewrite the above equation like
this:

    trace[i+1] = (exp(-T/(2*tau))*trace[i] + exp(-T/(2*tau)))*x'[i]

In load trace playback, we want the measured load average to track the
load average samples in the trace file.  To do this, we treat the
x'[i] in the above equation is the expected run-queue length during
the last T/2 seconds.  To determine the x'[i], we "deconvolve out" the
smoothing function using the method described in

@InProceedings{WINNER-WS-98,
  author = 	 "Olaf Arndt and Bernd Freisleben and Thilo Kielmann
		  and Frank Thilo",
  title = 	 "Dynamic Load Distribution with the Winner System",
  booktitle = "Proceedings of Workshop Anwendungsbezogene Lastverteilung (ALV'98)",
  year = 	 "1998 ",
  pages = 	 "77--88",
  month = 	 "March",
  note = 	 "Also available as Technische Universit�t M�nchen
		  Technical Report TUM-I9806",
}

All that is necessary to do this is knowledge of the smoothing
constant tau for the machine on which the trace was taken.  For
Digital Unix, tau is 5 seconds.  On most other Unix systems, tau is 60
seconds.  A larger tau value indicates that the load average will 
behave more smoothly.


A value x'[i] represents the amount of contention the CPU saw for the
the time interval.  To reproduce this contention, we split the interval
into smaller subintervals, each of which is bigger than the scheduler
quantum and stochastically assign subprocesses to either work or
sleep during the subintervals.  For example, suppose x'[i]=1.5.  We
would assign subprocess 0 to work during a subinterval with
probability 1, and subprocess 1 to work during a subinterval with
probability 0.5.   

After each individual x'[i] value has been played, the load average on
the system is measured.  This measurement, measure[i], is compared
against the load trace *as it would have appeared given the tau of the
playback machine* trace'[i].  trace'[i] is determined by smoothing the
x'[i] series with an exponential averaging filter having the playback
machine's tau.


Interacting With Other Load
---------------------------

Most likely, there will be other processes on the system, and so it is
important to understand how the load the playload tool generates
reacts to it.  Two modes of operation are possible, time-based and
work-based.  

In the time-based mode of operation, the contention probabilities
implied by the x'[i] value last only till the end of x'[i]'s interval
in real time.  Essentially, this means that the other load on the
machine can only amplitude modulate the played back load.  For example,
suppose there is a uniform 1.0 load on the machine and the load trace
dictates a 1.0 load from 0 to 1 second and zero elsewhere.  Then, the
measured load will (ideally) be 2.0 from 0 to 1 second and 1.0
elsewhere. 

In the work-based mode of operation, the load is interpreted as work
that must always be done, but which may be slowed down by other load.
This means that the other load on the system can also frequency
modulate the played load.  Using the same example as the previous
paragraph, the measured load would be 2.0 from 0 to 2 seconds and 1.0
elsewhere.  


Feedback
--------

By using error feedback, playload can sometimes more accurately track
the load average in the trace.  However, this feedback mechanism will
also try to correct for other load on the system, trying to make the
*total* load track the trace load.  You almost certainly do not want
this.



Compiling and using playload
----------------------------

To compile playload, you need a C compiler, ideally gcc.  If you
have gcc just run make.  If not, you may need to edit Makefile.  

playload is run as follows:

  % playload tauoftrace tauofhost feedback time|work
             Hz|nonperiodic alpha|network|text tracefile

  tauoftrace  - the smoothing time constant from the machine where the
                trace was taken 

  tauofhost   - the smoothing time constant of the machine on which 
                playload is run 

  feedback    - the feedback level (almost certainly negative) 

  time        - indicates playback should be time-based
  work        - indicates playback should be work-based

  Hz          - a number indicating the sampling rate of the trace or
                the sampling rate you want to use
  nonperiodic - indicates the timestamps of the trace should be used 

  alpha       - indicates the trace is in DEC alpha byte order
  network     - indicates the trace is in network byte order
  text        - indicates the trace is in ascii

  tracefile  - name of the trace file

For example:

   %  playload 5.0 5.0 0.0 work 1.0 network axp0.23.97.trace

This plays the network byte order trace in the file axp0.23.97.trace
using the work-based approach.  The tau of the trace is 5.0 and the
tau of the host is 5.0.  The sample rate of the trace is 1.0 Hz.  This
is correct for playing back a Digital Unix trace on another Digital Unix
machine.  The feedback level is zero, which is usually the best with
this version of playload.

Another example:

   %  playload 5.0 60.0 0.0 time 1.0 text axp0.23.97.txt

This plays back the text trace in axp0.23.97.txt using the time-based
approach.  The other difference is the host's tau is 60 seconds, which
is reasonable for most other unix systems, such as linux.
   

Trace Format
------------

playload supports three file formats: text, alpha, and network.  

In the text format, each line of the file contains two numbers
separated by whitespace.  The first number is the timestamp and the
second number is the load average for at that timestamp.  Timestamps
are ignored unless the nonperiodic option is used.

The alpha format is binary.  The file consists of pairs of double
precision (64 bit) floating point values stored in the native byte
order (LSB) on dec alphas.  The first number of each pair is the
timestamp while the second is the load average for that timestamp.
This is the format most of our load traces are available in.

The network format is binary.  The file consists of pairs of double
precision (64 bit) floating point values stored in network byte order
(MSB).  The first number of each pair is the timestamp while the
second is the load average for that timestamp.  


Output
------

Each sample in the load trace generates an line of output on standard
out.  Each line consists of 8 numbers separated by whitespace.

The first line of output is special.  It begins with a comment
character (#) and describes columns.  The columns are:

Time      -  the time immediately after finishing playing the load sample
Desired   -  what the load average sample from the trace would have
             been given this machine's tau and external load
Measured  -  the corresponding actual load average on the machine
             measured immediately after finishing playing the sample
Target    -  The "deconvolved" load that was applied 
             for various reasons this can sometimes be negative
Request   -  MAX(0,Target) - what playload actually tried to play
DMeasured -  The "deconvolved" load that was measured 
Error     -  The error - Target-Measured
Percent   -  Percentage error


Additional Tools
----------------

Also included in this distribution is a perl script, plot_playback.pl,
which can be used to compare the load trace, the desired load
measurements during playback (recall that the desired load is
basically the load average measurements we expect given this machine's
tau), and the actual measured load during playback.   

In addition to a statistical summary, plot_playback.pl will also
produce an encapsulated postscript plot of the three signals.  You
will need gnuplot for this to work.  

A tool called findtau will also be produced.  This can be used to
estimate the tau of a host.  It's generally easier to apply the rule
of thumb that mach-derived OSes have tau=5.0 and others have tau=60.0,
but if you suspect that your machine is different, you should try
using findtau.

The routines in the file LoadTrace.c may useful to you for reading
and writing load traces in various formats.


Why Load Playback is Imperfect
------------------------------

You will probably discover that the measured load during playback
tracks the desired load only approximately in some cases.  To see an
extreme example of this, try playing back a load trace with a
continuous value of 0.5 - the mean measured load will be 0.5, but
you'll see quite a bit of variability.  This section explains why.

The sampling process that produces the x[i]s in the above equation is
the OS's scheduler, which sees only integral numbers of processes on
the ready queue.  We treat our approximation of x[i], x'[i], as the
expected value of x[i].  So, if x'[i]=0.5, we have a subprocess spend
50% of its time sleeping and 50% working, and we alternate randomly
between these extremes.  The expected run queue length the scheduler
would see is then, E[x[i]] = 0.5*1 + 0.5*0 = 0.5.  However, the
scheduler will really see either 1 or 0.  The effect on the load
average in either case will be drastically different than the expected
value resulting in error.  Furthermore, the load is an exponential
average, the error will persist over some amount of time (it will
decline to 33% after tau seconds).  Another way of thinking about this
is to consider the second moment, E[x[i]^2].  For this example,
E[x[i]^2] = 0.5*1^2 + 0.5*0^2 = 0.5, so the standard deviation of the
distribution of E[x[i]] is sqrt(E[x[i]^2]-E[x[i]]^2) = sqrt(0.5-0.25)
= 0.5 which explains the variability we see.  

But, WAIT!  What if the sample rate of the trace or of playback is too
low and the x'[i] value we compute or its corresponding measurement is
actually the result of more than one observation by the scheduler?
Won't that make the measured value closer to the expectation?  Yes, it
will, but there will still be error that gets propagated over time
through the exponential filter.  To extend the example above, suppose
the scheduler takes the average of n observations to determine x[i].
The expected value of the sum of those observations,
E[(x_1+x_2+...+x_n)] will be n/2, which, divided by n gives us an
expected load of 0.5 as we desire.  The sum will be distributed
according to the binomial distribution, so the standard deviation of
the sum will be sqrt(n*0.5*(1-0.5)) = 0.5*sqrt(n), and the standard
deviation of the of the average will be 0.5/sqrt(n).  

Another source of error is that an actual process on the ready queue
may not consume its full quantum.  


Why Do Real Traces Play More Accurately Than Synthetic Ones?
------------------------------------------------------------

In almost all cases, when a real load trace that has been sampled at
a sufficiently high rate is played, the x'[i]s are close to integers.
Intuitively, this makes sense - the scheduler can only observe
integral numbers of processes on the ready queue, and if our estimates
of its observations (the x'[i]s) are accurate, they should also mostly
be integers.  

It's easy to construct synthetic traces that are actually not sensible
in terms of short term behavior.  One such trace is the continuous 0.5
example discussed in the previous section.  Deconvolving the trace
produces x'[i]s slightly above 0.5.  Playload reasonably produces a
situation where the expected ready queue length is 0.5 by having a
process spend half of its time working and half sleeping.  However,
the observations the kernel (the x[i]s) will make will be either 0 or
1.  Thus the measured load trace will vary widely.  The *average* load
average will match to the 0.5 we desire, but the short term behavior
will not.  It's important to note that playload is doing what the user
probably *wants* (keeping the CPU 50% busy, which can be verified
with vmstat), but that the load average fails to be a good measure for
how well the generated load conforms.

Another way a synthetic trace can fail is to have very abrupt changes.
For example, a square wave will be reproduced with lots of overshoot
and error.  In order for the abrupt swings of a square wave to have
been the output of the exponential smoothing, the input x[i] must have
been much more abrupt, and the estimate x'[i] will also be quite
large.  This means playload has to have many processes try to contend
for the CPU at once, which raises the variability considerably.

The best way to produce synthetic traces is to create a string of
integer valued x[i]s and smooth them with the appropriate tau.  
Another possibility is to present these x[i]s directly to playload
with a very small tautrace value.


Why Don't You Split The Load Evenly Among Your Subprocesses?
------------------------------------------------------------

This has to do with minimizing the second moment, E[x[i]^2], which is
described above.  Suppose we want to reproduce an x'[i] in the range 1
to 2 using 2 processes, a and b.  We'll denote that a process a is on
the ready queue as a=1 and not on the ready queue as a=0.  The
probability of process a being on the ready queue is P[a=1]=L/2, and
for b it is P[b=1]=L/2.  Now, let's compute the expectations:

a b  a+b  (a+b)^2   probability
0 0   0     0       (1-L/2)^2
0 1   1     1       (L/2)(1-L/2)
1 0   1     1       (L/2)(1-L/2)
1 1   2     4       (L/2)(L/2)

The expected value of the scheduler's sample, E[x[i]] = E[a+b] = L, as
we desire.  The second moment, E[x[i]^2] = E[(a+b)^2] = L^2/2 + L.

Now consider what happens if we instead choose P[a=1]=1 and
P[b=1]=L-1, which is what playload does:

a b  a+b  (a+b)^2   probability
0 0   0     0       0
0 1   1     1       0
1 0   1     1       1-(L-1)
1 1   2     4       L-1

Here, E[a+b] = L, as we want, but E[(a+b)^2] = 3L-2.  This is less
than L^2/2 + L in the range [1,2) and equal for L=2.  Thus, by not
sharing the load equally, we increase the likelihood that the
scheduler will observe what we want it to observe.