10.0 MPI

MPI (Message Passing Interface) is a standard for writing parallel programs in message-passing environments. For more information please see the web site, <http://www.epm.ornl.gov/~walker/mpi/>.

The current Legion implementation supports a core MPI interface, which includes messages passing, data marshaling and heterogeneous conversion. Legion supports legacy MPI codes, and provides an enhanced MPI environment using Legion features such as security and placement services. A link-in replacement MPI library uses the primitive services provided by Legion to support the MPI interface. MPI processes map directly to Legion objects.

10.1 Task classes

MPI implementations generally require that MPI executables reside in a given place on disk. Legion's implementation objects serve a similar role. We have provided a tool, legion_mpi_register, to register executables of different architectures for use with Legion MPI.

10.2 Installing Legion MPI

The Legion MPI library can be found in $LEGION/src/ServiceObjects/MPI.

10.3 Compilation

MPI code may be compiled as before. Link it against libmpi and the basic Legion libraries. The final linkage must be performed using a C++ compiler or by passing C++ appropriate flags to ld. A sample Legion MPI makefile is shown in Figure 9 (below).

Figure 9: Sample Legion MPI makefile

C       =       gcc
CC      =       g++
F77     =       g77
OPT     =       -g
INC_DIR =       /home/appnet/Legion/include/MPI
FFLAGS  =       -I$(INC_DIR) -ffixed-line-length-none 

mpitest: mpitest.o
        legion_link -Fortran -mpi mpitest.o -o mpitest

mpitest.o: mpitest.f
        $(F77) -c mpitest.f -I$(INC_DIR) $(FFLAGS)

mpitest_c: mpitest_c.o
        legion_link -mpi mpitest_c.o -o mpitest_c

mpitest_c: mpitest_c.c
        $(C) $(CFLAGS) mpitest_c.c -o mpitest_c $(LDFLAGS) $(LIBS)

10.4 Register compiled tasks

Run legion_mpi_register. Usage of this tool is:

legion_mpi_register <class name> 
	<binary path> <platform type>

The example below registers /myMPIprograms/vdelay (the binary path) as using a Linux architecture.

$ legion_mpi_register vdelay /myMPIprograms/vdelay linux

This creates MPI-specific contexts (/mpi, /mpi/programs, and /mpi/instances), a class object for this program, an implementation for this program, and registers the name in Legion context space.

The legion_mpi_register command may be executed several times if you have compiled your program on several architectures.

10.5 Running the MPI application

MPI programs are started using the program legion_mpi_run:

legion_mpi_run {-f <options file> [<flags>]} | 
	{-n <number of hosts> [<flags>] <command> [<args>]}

Parameters used for this command are:

-f <options file>

Allows users to run multiple MPI binaries with a common MPI_COMM_WORLD

-n <number of processes>

Specify the number of hosts on which the program will run

supported <flags> are1:

-h <host set context path>

Specify the set of hosts on which the program will run. The context path should not resolve to a particular host but to a context containing the names of one or more hosts. The default setting is the system's default placement: i.e., the system will pick a compatible host pick and try to run the object. If it fails the system will try another compatible host.

<host context name>

Runs the first process (i.e., process zero) on this node

-p <PID context>

Specify a context name for PIDs (default is /mpi/instances/program_name)

-S

Print statistics at exit

-v

Verbose option (up to four of these can be used for increased detail)

-d <Unix path name>

Specify that all children change to the specified directory before they begin to run.

Note that if the -f flag is used, the -n flag, program name, and any arguments will be ignored. Any <flags> used with -f will be treated as defaults and applied to all processes executed by this command, unless otherwise specified in the options file. The legion_mpi_run utility will expect one binary per line in the options file, each line including any necessary arguments as they would appear on the command line. Each line can also contain any of the legion_mpi_run flags except the -f flag.

If the -f flag is not used, the MPI application named in the legion_mpi_run command will then be started with the given flags.

For example:

$ legion_mpi_run -n 2 /mpi/programs/vdelay

Note that the <command> argument here is the Legion context space name for the class created by legion_mpi_register in the previous section. In this example the default path mpi/programs is used with the class name vdelay.

You can examine the running objects of your application using

$ legion_ls /mpi/instances/program_name

This context will have an entry for each object in this MPI application. If you run multiple versions of the application simultaneously, use the -p flag with legion_mpi_run to specify an alternate context in which to put the list of objects.

To view all instances of the program's class, use legion_list_instances:

$ legion_list_instances -c /mpi/programs/program_name

To destroy all of the program class's instances, use the legion_destroy_instances command. Be aware, though that this command will destroy all of the class's instances.

Note that MPI programs cannot or be deactivated. They simply die.

10.6 Example

There is a sample MPI program included in a new Legion system, written in C (mpitest_c.c) and Fortran (mpitest.f). Go to the directory containing the program's binary files:

$ cd $LEGION/src/ServiceObjects/MPI/examples

You must then register whichever version you wish to use with the legion_mpi_register command. The sample output below uses mpitest_c and shows Legion creating three new contexts (mpi, programs, and instances, the last two being sub-contexts of mpi), creating and tagging a task class, and registering the implementation of the mpitest_c program.

$ legion_mpi_register mpitest_c \
  $LEGION/bin/$LEGION_ARCH/mpitest_c linux
"/mpi" does not exist - creating it
"/mpi/programs" does not exist - creating it
"/mpi/instances" does not exist - creating it
Task class "/mpi/programs/mpitest_c" does not exist -
	creating it
"/mpi/instances/mpitest_c" does not exist - creating it
Set the class tag to: mpitest_c
Registering implementation of mpitest_c
$

In order to view the output of the program, you will need to create and set a tty object if you have not already done so (see page 69).

$ legion_tty /context_path/mytty

Having done all of this, you can now run the program, with the legion_mpi_run command to run the program. You must specify how many processes the program should use to run the program with the -n flag.

If you do not specify which hosts the program should run its processes on, Legion will arbitrarily choose the hosts from the host objects that are in your system. Here it runs three processes on three different hosts.

$ legion_mpi_run -n 3 /mpi/programs/mpitest_c
Hello from node 0 of 3 hostname Host2
Hello from node 1 of 3 hostname Host3
Node 0 has a secret, and it's 42
Node 1 thinks the secret is 42
Hello from node 2 of 3 hostname BootstrapHost
Node 2 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
Node 2 exits barrier and exits.
$

Another time it might run the three processes on two hosts.

$ legion_mpi_run -n 3 /mpi/programs/mpitest_c
Hello from node 1 of 3 hostname BootstrapHost
Hello from node 0 of 3 hostname Host3
Node 0 has a secret, and it's 42
Node 1 thinks the secret is 42
Hello from node 2 of 3 hostname BootstrapHost
Node 2 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
Node 2 exits barrier and exits.
$

10.7 Accessing files in programs using MPI

When you run MPI, normally, you run on a single machine and all of your input files are locally placed. However, when you run an MPI program in Legion you will need to be able to transparently access other files for input or output. Legion has three subroutines that allow your program to read and write files in Legion context space: LIOF_LEGION_TO_TEMPFILE, LIOF_CREATE_TEMPFILE, and LIOF_TEMPFILE_TO_LEGION.

The first, LIOF_LEGION_TO_TEMPFILE, lets you copy a file from Legion context space into a local file. For example,

call LIOF_LEGION_TO_TEMPFILE ('input', INPUT, ierr)
open (10, file = INPUT, status = 'old')

will copy a file with the Legion filename input into a local file and store the name of that local file in the variable INPUT. The second line opens the local copy of the file.

The other two subroutine can be used to create and write to a local file. For example:

call LIOF_CREATE_TEMPFILE (OUTPUT, IERR)
open (11, file = OUTPUT, status = 'new')
call LIOF_TEMPFILE_TO_LEGION (OUTPUT, 'output', IERR)

The first line creates a local file and stores the file's name in the variable OUTPUT. The second line opens the local copy of the file. The third line copies the local file OUTPUT into a Legion file with the name output.

While this approach allows you to run MPI programs and transparently read and write files remotely, it does have one limitation: it does not support heterogeneous conversions of data. If you run this program on several machines which have different formats for an integer, such as Intel PC's (little-endian) and IBM RS/6000's (big-endian), the result of using unformatted I/O will be surprising. If you want to use such a heterogeneous system, you will have to either use formatted I/O (all files are text) or use the "typed binary" I/O calls instead of Fortran READ and WRITE statements. These "typed binary" I/O calls are discussed in section 6.2.4 in the Legion Developer Manual.

10.8 Scheduling MPI processes

As noted above, Legion may not choose the same number of hosts objects as processes specified in the -n parameter. If you specify three processes, Legion will arbitrarily choose to run them on one, two, or three hosts.

If you wish to specify which hosts are used to run your processor, use the -h flag. In the example below, we will run mpitest_c on four hosts, which we will assume are already part of the system (see Host and vault objects for information about adding hosts to a Legion system) and which, in this example, are named in a single context called mpi-hosts.

First, you must create the mpi-hosts context with the legion_context_create command:

$ legion_context_create mpi-hosts
Creating context "mpi-hosts" in parent ".".
New context LOID = "1.01.05.44b53908.000001fc0940fbf7

We can now assign new context names to the four hosts that we wish to use for this program, using the legion_ln command. These new names map to the host objects' LOIDs, and provide a convenient way to specify different hosts for different MPI programs or to execute the same program on different groups of hosts.

$ legion_ln /hosts/Host1 /mpi-hosts/mpitest_Host1
$ legion_ln /hosts/Host2 /mpi-hosts/mpitest_Host2
$ legion_ln /hosts/Host3 /mpi-hosts/mpitest_Host3
$ legion_ln /hosts/Host4 /mpi-hosts/mpitest_Host4

The program can now be run with legion_mpi_run, using the -h flag to specify that the four processes should be run on the hosts in mpi-hosts. Note that the nodes are placed on the hosts in order.

$ legion_mpi_run -h /mpi-hosts -n 4 /mpi/programs/mpitest
Hello from node 0 of 4 hostname Host1.xxx.xxx
Hello from node 1 of 4 hostname Host2.xxx.xxx
Node 0 has a secret, and it's 42
Hello from node 2 of 4 hostname Host3.xxx.xxx
Node 1 thinks the secret is 42
Node 2 thinks the secret is 42
Hello from node 3 of 4 hostname Host4.xxx.xxx
Node 3 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
Node 2 exits barrier and exits.
Node 3 exits barrier and exits.
$

Note also that the -n parameter is still used, even though there are only four host objects named in the mpi-hosts context. The number of processes and number of host object do not have to match: if you wanted to run only two processes on the hosts named in mpi-hosts you could use -n 2. The program would then run on the first two hosts listed in the mpi-hosts context.2

$ legion_mpi_run -h /mpi-hosts -n 2 /mpi/programs/mpitest
Hello from node 0 of 2 hostname Host1.xxx.xxx
Node 0 has a secret, and it's 42
Hello from node 1 of 2 hostname Host2.xxx.xxx
Node 1 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
$

10.9 Debugging support

A utility program, legion_mpi_debug, allows the user to examine the state of MPI objects and print out their message queues. This is a handy way to debug deadlock. Usage is:

legion_mpi_debug [-q] 
	{-c <instance context path>}

The -q flag will list the contents of the queues. For example:

$ legion_mpi_debug -q -c /mpi/instances/vdelay
Process 0 state: spinning in MPI_Recv source 1 tag 0 comm?
Process 0 queue:

(queue empty)

Process 1 state: spinning in MPI_Recv source 0 tag 0 comm?
Process 1 queue:

MPI_message source 0 tag 0 len 8
$

Do not be alarmed that process 1 hasn't picked up the matching message in its queue: it will be picked up after the command has finished running.

There are a few limitations in legion_mpi_debug: If an MPI object doesn't respond, it will hang, and it won't go on to query additional objects. An MPI object can only respond when it enters the MPI library; if it is in an endless computational loop, it will never reply.

Output from a Legion MPI object to standard output or Fortran unit 6 is automatically directed to a tty object (see page 69). Note, though, that this output will be mixed with other output in your tty object so you may wish to segregate MPI output. Legion MPI supports a special set of library routines to print on the Legion tty, which is created using the legion_tty command. This will cause the MPI object's output to be printed in segregated blocks.

LMPI_Console_Output (string)
Buffers a string to be output later.

LMPI_Console_Output_Flush ()
Flushes all buffered strings to the tty.

LMPI_Console_Output_And_Flush (string)
Buffers and flushes in one call.

The Fortran version of these routines adds a carriage return at the end of each string, and omits trailing spaces. The C version does not.

10.10 Checkpointing support

As you increase the number of objects or the length of time for which an MPI application runs, you increase the probability that the application will crash due to a host or network failure.

To tolerate such failures, we provide a checkpointing library for SPMD-style (Single Program Multiple Data) applications.3 SPMD-style applications are characterized by a regular communication pattern. Typically, each task is responsible for a region of data and periodically exchanges boundary information with neighboring task.

We exploit the regularity of SPMD applications to provide an easy-to-use checkpointing library. Users are responsible for (1) deciding when to take a checkpoint and (2) writing functions to save and restore checkpoint state. The library provides checkpoint management support, i.e. fault detection, checkpoint storage management, and recovery of applications. Furthermore, users are responsible for taking checkpoints at a consistent point in the program. In the general case, this requirement would overwhelm most programmers. However, in SPMD applications, there are natural places to take checkpoints, i.e. at the top or bottom of the main loop [2].

10.10.1 Example

To use the checkpoint library, users are responsible for the following tasks:(1) recovery of checkpoint data, (2) saving the checkpoint data, and (3) deciding how often to checkpoint.

The following example is included in the standard Legion distribution ($LEGION/src/ServiceObjects/MPI/ft_examples) and consists of tasks that perform some work and exchange boundary information at each iteration.

Example: Sample uses of the checkpoint library

//
// Save state. State consists of the iteration count.
//
do_checkpoint(int iteration) {
        // pack state into a buffer
                MPI_Pack(&iteration, 1, MPI_INT, user_buffer,
                     1000, &position, MPI_COMM_WORLD);
        
        // save buffer into checkpoint
                MPI_FT_Save(user_buffer, 20);

        // done with the checkpoint
                MPI_FT_Save_Done();
}

//
// State consists of the iteration count
//
do_recovery(int rank, int &start_iteration) {
        // Restore buffer from checkpoint
                size = MPI_FT_Restore(user_buffer, 1000);

        // Extract state from buffer
                MPI_Unpack(user_buffer, 64, &position, &start_iteration, 1,
MPI_INT, MPI_COMM_WORLD);
}

//
// C pseudo-code
// Simple SPMD example
//
// mpi_stencil_demo.c
//
main(int argc, char **argv) {
        // declaration of variables omitted...

        // MPI Initialization
        MPI_Init (&argc, &argv);
        MPI_Comm_rank (MPI_COMM_WORLD, &nodenum);
        MPI_Comm_size (MPI_COMM_WORLD, &nprocs);

        // Determine if we are in recovery mode
        recovery = MPI_FT_Init(nodenum, start_iteration);
        if (recovery)
            do_recovery(nodenum, start_iteration);
        else
            start_iteration = 0;

        for (iteration=start_iteration; 
             iteration < NUM_ITERATIONS; ++iteration) {

        exchange_boundary();

        // ...do some work here...

           // take a checkpoint every 10th iteration
           if (iteration%10==0)
                do_checkpoint(iteration);
        }
}

The first function called is MPI_FT_Init(). MPI_FT_init() initializes the checkpointing library and returns TRUE if the object is in recovery mode. If in recovery mode, we restore the last consistent checkpoint. Otherwise, we proceed normally and periodically take a checkpoint.

In do_checkpoint() we save state by packing variables into a buffer using MPI_Pack() and then calling MPI_FT_Save(). To save more than one data item, we can pack several items into a buffer (by repeatedly calling MPI_Pack) and then call MPI_FT_Save(). When the state to be saved is very large, we can break it down into smaller chunks and call MPI_FT_Save() for each chunk. When all data items for the checkpoints have been saved, we call MPI_FT_Save_Done().

In do_recovery() we recover the state. Note that the sequences of MPI_Unpack() and MPI_FT_Restore() must be mirror images of the sequences of MPI_Pack() and MPI_FT_Save() in do_checkpoint().

10.10.2 API (C & Fortran)

C API

Fortran API

Description

int MPI_FT_Init(int rank)

MPI_FT_INIT(RANK, RECOVERY)
INTEGER RANK, RECOVERY

Initializes the checkpointing library. Returns TRUE if in recovery mode

int MPI_FT_ON()

MPI_FT_ON(FT_IS_ON)
INTEGER FT_IS_ON

Returns TRUE if this application is running with checkpointing turned on.

void MPI_FT_SAVE(char *buffer,
    int size)

MPI_FT_SAVE(BUFFER, SIZE, IERR)
<type> BUFFER(*)
INTEGER SIZE, RET_SIZE

Saves checkpoint onto Checkpoint Server.

void MPI_FT_SAVE_DONE()

MPI_FT_SAVE_DONE(IERR)
INTEGER IERR

Done with this checkpoint.

int MPI_FT_RESTORE(char *buffer,
    int size)

MPI_FT_RESTORE(BUFFER, SIZE,
    RET_SIZE)
<type> BUFFER(*)
INTEGER SIZE, RET_SIZE

Restore data from current checkpoint.

10.10.3 Running the above example

First, run the legion_ft_initialize script on the command line. You should see the following output:

$ legion_ft_initialize
legion_ft_initialize: Could not find a CheckpointStorage object ...
Attempting to register one
Stateful class "/CheckpointStorage" does not exist - creating it.
Registering implementation for class "/CheckpointStorage"
legion_ft_initialize: success initializing FT SPMD Checkpointing support
$

Note that you only need to run this command once.

Compile the application and register the program with Legion:

$ cd $LEGION/src/ServiceObjects/MPI/ft_examples
$ make reg_stencil

Create an instance of the CheckpointStorage server and place it in context space:

$ legion_create_object -c /CheckpointStorage \ 
  /home/username/ss

Run the application and specify a ping interval and reconfiguration time:

$ legion_mpi_run -n 2 -ft -s /home/username/ss -g 200 \
   -r 500 mpi/programs/mpi_stencil_demo

There is a set of special legion_mpi_run flags that can be used when running in a fault tolerant mode. These flags are used in specific combinations. You must use the -ft in all cases and you must use either -s or -R. The -g and -r flags are optional.

-ft

Turn on fault tolerance.

-s <checkpoint server>

Specifies the checkpoint server to use.

-g <pingInterval>

Ping interval. Default is 90 seconds.

-r <reconfigurationInterval>

When failure is detected (i.e. an object has not responded in the last <reconfigurationInterval> seconds) restart the application from the last consistent checkpoint. Default value is 360 seconds.

-R

Recovery mode.

10.10.4 Recovering from failure

If an object fails (i.e. has not responded in the last <reconfigur-ationInterval> seconds) the application automatically restarts itself.

10.10.5 Restarting application

Under catastrophic circumstances (e.g., prolonged network outages), the recovery protocol described above may not work. In this case users can restart (i.e., rerun) the entire application with the -R flag.4 Continuing the above example:

$ legion_mpi_run -n 2 -ft -R /home/username/ss -g 200 \ 
   -r 500 mpi/programs/mpi_stencil_demo

The application will restart from the last consistent checkpoint. Note that the number of processes and checkpoint server name must match that of the original run.

Once the application has successfully completed you should delete the Checkpoint Server by typing:

$ legion_rm -f /home/username/ss

If you would like to test the restart mechanism, you can simulate failure then restart the application. Kill the application by typing ^C (Control-C).

10.10.6 Compiling/makefile

To compile C applications with the checkpointing library, link your application with the -lmpift -lLegionFT libraries. For Fortran applications, link with the -lmpiftf library also.

A sample makefile and application is provided in the standard Legion distribution package, in the $LEGION/src/ServiceObjects/MPI/ft_examples directory.

10.10.7 Another example

For a more complicated example, please look at the application in $LEGION/src/Applications/ocean. This Fortran code contains the skeleton for an ocean modeling program.

10.10.8 Limitations

We use pings and timeout values to determine whether an object is alive or dead. If the timeout values are set too low, we may declare an object dead falsely. We recommend setting the ping interval and reconfiguration time to conservatively high values.

We do not handle file recovery. When running applications that write data to files, users are responsible for recovering the file pointer.

10.11 Functions supported

Please note that data packed/unpacked using the MPI_Pack and MPI_Unpack functions will not be properly converted in heterogeneous networks.

All MPI 1.1 functions are currently supported in Legion MPI. Some 2.0 functions are supported: please contact us at