Table of Contents
. If you click on it, you will go to the FAQ section relevant to that issue.
$ . ~legion/setup.sh
or
$ source ~legion/setup.cshThe following style conventions are used in these tutorials:
| Legion's remote execution allows you to run remotely executed programs from the Legion command line. Your workload can be spread out on several processors, improving job turn-around time and performance. You can execute a single instance of a program, multiple instances of a single program, or several different programs in parallel. |
| What are the prerequisites for running a program in Legion? | About context space ![]() |
legion_register_runnable <program class> <executable path> <architecture>
The currently available values for <architecture> are:
|
The <program class> parameter is a context path name for the Legion objects that will handle a particular program. Choose a context path that best suits you, though we suggest that the path should contain the program's name, so as to make it easier to remember. For example, to register a program called "Doris" you might use the class name "doris":
$ legion_register_runnable doris /bin/programs/Doris linux Program class "doris" does not exist. Creating class "doris". Registering implementation for class "doris" $
$ legion_register_runnable doris /bin/programs/Doris solaris Registering implementation for class "doris" $
The output shows Legion registering the new implementation for program class we registered above, so there's no need to create another class. When we run Doris, we'll be able to run it on linux or solaris. If you reregister a program several times using the same program class and architecture, the most recently registered version will be used.
Please note that when Legion registers a program, it makes a copy of the binary executable and uses that copy when you execute the program with legion_run or legion_run_multi. If you change the program, you'll need to reregister it.
Independent programs
Independent programs are registered with the legion_register_program command. The syntax is:
legion_register_program <program class> <executable path> <legion arch>
The currently available values for <architecture> are listed above.
The <program class> parameter is a context path name for the class that will manage Legion objects related to a particular program. An example of this command might look like this:
$ legion_register_program doris /bin/programs/Doris linux Program class "doris" does not exist. Creating class "doris". Registering implementation for class "doris" $
$ legion_register_program doris /bin/programs/Doris solaris Registering implementation for class "doris" $
The output shows Legion registering the new implementation for program class we registered above, so there's no need to create another class. When we run Doris, we'll be able to run it on linux or solaris. If you reregister a program several times using the same program class and architecture, the most recently registered version will be used.
Please note that when Legion registers a program, it makes a copy of the binary executable and uses that copy when you execute the program with legion_run or legion_run_multi. If you change the program, you'll need to reregister it.
|
You can use these flags with legion_run, when you first start the job, or with legion_probe_run, after the job has started.
Note that you can ask for multiple input and output files when you execute a program by repeating the -in/-out and -IN/-OUT options.
| Running a single instance remotely | About the tty object
legion_run FAQ
legion_probe_run FAQ
|
There are many options associated with this command but we will not consider all of them here. In this section of the tutorial, we will look at the steps in two different runs of the fictional program Doris. We will assume that this program was previously registered with the program class name doris and linux architecture and is now ready to run.
You can see the Reference Manual for an explanation of the options and "Executing programs remotely" in the Basic User Manual for more information on using this command.
In both cases, our first decision is whether to run the program in blocking or nonblocking mode.
Let's assume that Doris only takes a few minutes to run, so first we want to run the command in its default (blocking) mode. To be even-handed, though, we'll then run it in nonblocking mode. Let us further suppose that Doris needs two input files, "input1" and "input2", and produces two output files "output1" and "output2".
Before we start, we want to be sure that we have a tty object set for the current shell:
$ legion_tty /tty_objects/mytty
Finally, we want to decide whether or not to start a probe file.
We don't have to start one, but since we want to check Doris's status as it is running, we will use the -p option to start a probe file called "dorisProbe".
Blocking mode
When you run a program in blocking mode, the legion_run command will continue to run on your command line, blocking any new input until the remote job has finished. Once the job is done, all output files are collected, the job's working directory on the remote host is cleaned up and deleted, and the command exits. Since our program, Doris, doesn't take long to run, it won't be much of an issue.
$ legion_run -IN input1 -IN input2 -OUT output1 -OUT output2 -p dorisProbe doris
We used the -IN flag to tell Legion to copy the input files from our local host to the remote host before the job started, although you can pass these names along after you start a job.
Note that we didn't need to tell Legion to run in blocking mode, since that's the default setting.
Now that job has started, we go to another command line and use legion_probe_run's -hostname and -statjob flags to check on its status and see what host it is running on (being sure to set the tty object in that shell first).
$ legion_probe_run -hostname -statjob -p dorisProbe doris gander.cs.virginia.edu Running
The output shows that the job was placed on gander.cs.virginia.edu and that it is still running. If we had wished to place the job on a specific host, we could have used legion_run's -h flag. We also could have used its -d flag to place it in a specify working directory on the remote host.
If you need to kill the job midway through, you can use legion_probe_run's -kill flag. Note, though, that legion_run will continue to block at your command line so you will need to kill it by hand.
Once the program has finished, legion_run copies the -OUT files to our local host, deletes the probe files, and cleans up the remote host.
Nonblocking mode
When you run a program in nonblocking mode, the the legion_run command starts the job on the remote host, waits ten seconds, verifies that the job can start, and then exits. When the job finishes, it copies any files marked with the -out flag into context space but ignores any files marked with the -OUT flag. The remote host can hold on to the job's working directory for six hours, but if you do not take steps to pick up the remaining output files and clean up the remote host during that period, the remote host will tar and compress the job's working directory and put it in your context scratch space.
$ legion_run -nonblock -IN input1 -IN input2 -OUT output1 -OUT output2 -p dorisProbe doris
We used the -nonblock flag to start the job in nonblocking mode and we started a probe file. We strongly suggest that you always use a probe file with nonblocking runs. As above, we used the -IN flag to tell Legion to copy the input files from our local host to the remote host before the job started.
We can use the legion_probe_run command to check on its status and see what host it is running on. We'll also use the -pwd flag to get the name of the job's current working directory.
$ legion_probe_run -hostname -pwd -statjob -p dorisProbe doris gander.cs.virginia.edu /myDir/OPR/BootstrapVaultOPR/ReservedOPR-1.80 Running
The output shows that the job was placed on gander.cs.virginia.edu and that it is still running. The -statjob flag is especially use in nonblocking runs, since you have no other way to find out if the job has finished or hit an error.
Once the program has finished, the remote host will hold on to the /myDir/OPR/BootstrapVaultOPR/ReservedOPR-1.80 directory for six hours. Even though we told legion_run to pick up the two output files, the command has exited by now.* If we do nothing, gander.cs.virginia.edu will tar up the directory and put it in our context scratch space. We will then have to retrieve it, untar it, and get the output files.
To avoid this hassle, use legion_probe_run's -OUT and -kill options. First, we pick up the output files:
$ legion_probe_run -OUT output1 -OUT output2 -p dorisProbe doris
Then we clean up the remote directory:
$ legion_probe_run -kill -p dorisProbe doris
A word of caution about the -kill flag: if the job is still running, it will terminate the job (hence the name). You will lose all data from the job, including all output files.
Option file
If you are trying to keep track of several options when you start your program or are running the program over and over again, you may wish to use an option file. This is a text file that contains a list of the flags and settings you want to use when running your program. You can include any legion_run flags, but you must put the program class name and any arbitrary command-line arguments on the command line. You must also put the -f flag and the option file name on the command line. For example, if we used an option file for the nonblocking Doris run above, it would look like this:
-nonblock -IN input1 -IN input2 -OUT output1 -OUT output2 -p dorisProbe
In this example we separated the flags by new lines, but you can also use tabs or blank spaces. Doris has no arguments, so we can start it by entering:
$ legion_run -f dorisOptionFile doris
This may be helpful if several people want to run the program at different points, since they can share the option file.
| Running multiple instances remotely | legion_run_multi FAQ
|
This command is complex but more flexible than legion_run. There is a detailed FAQ about this command, as well as documentation on-line and in the Reference Manual. Rather than repeating information provided elsewhere, this section will go through the steps of starting the fictional program Gladys. We'll assume that you've already registered it with the class name gladys and a Linux architecture.
First of all, we want to be sure that we have a tty object set for the current shell:
$ legion_tty /tty_objects/mytty
Scheduling the jobs
You must tell Legion the maximum number of processes can run at a time and/or the name of a schedule file. You name the number of processes with the -n flag and the schedule file with the -s flag. In this case we'll just use a schedule file, called GladysSchedule.txt. We want Gladys to run up to a given number of jobs on the hosts listed in this file:
/hosts/Abba 5 /hosts/Dabba 8 /hosts/Doo 3
Please be sure that the hosts listed in your schedule file are all of appropriate architecture: remember that Gladys is registered to run on linux, so we don't want to try to run it on an RS6000.
Tracking jobs, moving files
Every copy of Gladys will expect to find files called "Gladys1," "Gladys2", "Gladys3", and "password" in its local execution environment. Since several copies will be running on three different hosts, the expected input files must be passed to each copy's working directory. Each copy will also produce output files called "GladysDone" and "GladysAlt". If we were running a single instance there would be no problem, but when running multiple instances we have to consider the prospect of sixteen different "GladysDone" files. Furthermore, how is legion_run_multi (which runs in blocking mode, remember) going to know when all of the jobs are finished or if some have failed?
The solution to this is a specification file.
This file controls the distribution and collection of input and output files, constants, and standard input, output, and error files on the different hosts. It also keeps track of each job's status, so that if a job fails it can be moved to another host and restarted. Each line contains three fields: a keyword
, a filename, and a pattern
. These fields are used to compile a list of jobs to run. For example, our specification file is called GladysSpecification.txt and looks like this:
IN Gladys1 Gladinput*txt in Gladys2 /myContext/Gladys/Gladnew*.ent in Gladys3 /myContext/Gladys/Glad*alt.txt OUT GladysDone Gladoutput*.txt OUT GladysAlt /myContext/Gladys/Gladdone*.txt CONSTANT password /crypt/Gladpasswd
Legion compiles a list of input files that fit the pattern field and looks for jobs to be run. If it finds GladnewFoo.ent and GladFooalt.txt in /myContext/Gladys/, that is a potential job that can be called job Foo. It then looks for files that match the output files. If it finds "GladoutputFoo.txt" on the local host and /myContext/Gladys/GladdoneFoo.txt in context space, it knows that job Foo has already been run and that these input files can be ignored. If it doesn't find those two Foo output files, it know that job Foo needs to be run.
If necessary, we can fine-tune the specification file with an exception file.
This file provides extra information for some of your program's jobs. For example, if we wanted to be sure that job Foo runs on the Abba host, we could use an exception file.
Running the jobs
We can now start the program:
$ legion_run_multi -v -s GladysSchedule.txt -f GladysSpecification.txt gladys
We suggest that you always use the -v flag so that you can see what is happening, especially if the program will take more than a few seconds to run.
We are starting multiple instances of legion_run. Each copy of legion_runstarts one of the Gladys jobs: Foo, Bar, and Beowulf. Each copy also runs in nonblocking mode and automatically creates a probe file for each job. These files are called "legion_probe_run_<job name>" and are in the ./.legion_run_multi_<random number>_ directory. In our example, the probe files would be called:
legion_probe_run_Foo legion_probe_run_Bar legion_probe_run_Beowulf
You can check any of these jobs from the command line by running legion_probe_run.
-lLegionRun -lLegion1 -lLegion2
Other relevant on-line documents:
* The exception to this rule occurs if your programs runs in less than ten seconds. Since legion_run always blocks for ten seconds to be sure that the remote job can start, if your job finished in that period the -OUT files will be picked up. back
Last modified: Tue Sep 5 17:23:55 2000
|
[Testbeds] [Et Cetera] [Map/Search]
legion@Virginia.edu
|