PVM Implementations of Fx and Archimedes

Talk Abstract
pdinda@cs.northwestern.edu

Ports by: Peter A. Dinda (Fx) and David R. O'Hallaron (Archimedes)

Introduction

This talk discusses two parallel compiler systems that were ported from the iWarp supercomputer to PVM, and our experiences with PVM as a compiler target and a user vehicle. The first of these systems, Fx, compiles a variant of High Performance Fortran (HPF) while the second, Archimedes, compiles finite element method codes.

In general, we found PVM was an easy environment to port to, but at the cost of performance. PVM was considerably slower than the native communication system on each of the machines we looked at (DEC Alphas with Ethernet, FDDI, and HiPPI, Intel Paragon, Cray T3D). Much of this slowdown is probably due to the extra copying needed to provide PVM's programmer-friendly semantics, which, as compiler-writers, are unnecessary to us. Although PVM goes a long way to making parallel programs portable, we found it was necessary to make minor (Paragon) to major (T3D) modifications to run PVM programs on MPPs.

The details of running PVM programs are hard to hide from users. Although our toolchain hides the details of compiling and linking for PVM, once an executable is produced, the user is left to deal with hostfiles, daemons, and other details of execution - issues that are nonexistent under the operating systems of MPPs.

The Fx language is a variant of High Performance Fortran (HPF) which integrates task parallelism into the overall data parallel HPF framework. Data parallelism is expressed by Fortran 90 array assignment statements and parallel loops over distributed arrays. Task parallelism allows the programmer to instantiate several data parallel routines at a time and specify how data flows between them. For example, a two dimensional FFT could be decomposed into a parallel loop over the rows followed by a parallel loop over the columns. With task parallelism, both loops could operate at the same time, forming a pipeline. Fx has been used to build or parallelize a number of real applications, including Air Quality Modelling, Stereo Vision, Synthetic Aperture RADAR, Earthquake Ground Motion Modeling, Magnetic Resonance Imaging, and Narrowband Tracking RADAR.

The Fx compiler translates an Fx program into a SPMD Fortran 77 program that calls on the Fx run-time system to perform communication. The F77 source is compiled using the native Fortran compiler and linked with the run-time system and PVM libraries. Porting to PVM involved minor changes to the compiler, mostly to support program startup and shutdown, and a writing a PVM-based run-time system.

On a workstation cluster, a parallel Fx program exhibits the behavior a user would expect from a sequential program. When executable is run, it spawns the necessary number of copies of itself using PVM. It also spawns a monitor program which gracefully shuts down the application should a problem arise in any task. All I/O (except for parallel file I/O) is performed by the process which was spawned by the user - thus the user can use the Fx program like any other Unix program. For example, the user could include it in a pipeline. On an MPP, program startup varies from machine to machine. For example, on the Paragon, the user must run a "wrapper" program which spawns all the copies of the Fx program.

PVM lets Fx target workstation clusters which, for some applications, prove significantly better than MPPs. For example, the chemical reaction component of the Air Quality Modeling application runs as fast on four DEC Alphas than on 32 nodes of the Intel Paragon:

Archimedes

Archimedes is a system for unstructured finite element codes. Given a problem geometry and the finite element algorithm (written in C) to apply, Archimedes will generate and partition a mesh. The mesh is then mapped onto the nodes of a parallel computer and appropriate communication is generated. Our famous Earthquake Ground Motion Modeling Project at CMU.

PVM as a Compiler Target

PVM's API and semantics make it easy for programmers to use it directly. However, as compiler writers, we are prepared to use lower level or more complex interfaces in return for better performance. In fact, in Gigabit Nectar and Credit Net testbeds.

Portability is another concern we have with PVM. Although PVM programs are highly portable among workstations, each MPP's implementation of PVM seems to be different, requiring considerable special casing in order to achieve portability. For us, the T3D's implementation required the most changes, while the Paragon's required the fewest.

PVM for Users

Since Fx and Archimedes users are not directly programming using a message passing system, we want to hide it from them as much as possible. This has proven to be difficult in the case of PVM, mostly because of the PVM daemon and overlapping virtual machines.

We expose starting the PVM daemon to the user because we want a nondefault executable path for each host - something we can not configure using pvm_addhosts(). Further, in practice, starting daemons on different machines in a network environment as complex as CMU's can be quite painful due to differing security mechanisms and machines equipped with multiple network adaptors. Finally, since we use only task-to-task communication (RouteDirect), and don't need dynamic virtual machines, the daemon seems superfluous.

Because each PVM user establishes his own virtual machine and different virtual machines can contain the same computer, application performance can vary considerably with few clues as to why. Although it is possible to run PVM jobs under a queueing system such as DQS, it seems like job queueing on a single, shared virtual machine would be a natural extension to PVM and eliminate the need for users to have to deal with more than one tool. Such an extension would also make it easier to hide PVM from our users by centralizing PVM configuration under our direct control.

Conclusion

Despite the difficulties discussed above, it was relatively easy to port Fx and Archimedes to PVM, and then target different architectures including MPPs. The downside is that no PVM implementation we used is anywhere near as fast as the native message passing system, so for us PVM is largely a tool for rapid porting, but not for production-level use.