GAMS [ Home | Downloads | Documentation | Solvers | APIs | Tools | Model Libraries | Resources | Sales | Support | Contact Us | Search ]

The Grid and Multi-Threading Solve Facility

Introduction

As systems with multiple CPUs and High Performance Computing Grids are becoming available more widely, the GAMS language has been extended to take advantage of these new environments. New language features facilitate the management of asynchronous submission and collection of model solution tasks in a platform independent fashion. A simple architecture, relying on existing operating system functionality allows for rapid introduction of new environments and provides for an open research architecture.

A typical application uses a coarse grain approach involving hundreds or thousands of model solutions tasks which can be carried out in parallel. For example:

  • Scenario Analysis
  • Monte Carlo Simulations
  • Lagrangian Relaxation
  • Decomposition Algorithms
  • Advanced Solution Approaches

The grid features work on all GAMS platforms and have been tailored to many different environments, like the Condor Resource Manager, a system for high throughput computing from the University of Wisconsin or the Sun Grid Engine. Researchers using Condor reported a delivery of 5000 CPU hours in 20 hours wall clock time.

Disclaimer. The use of the term grid computing may be offensive to some purists in the computer science world. We use it very loosely to refer to a collection of computing components that allow us to provide high throughput to certain applications. One may also think of it as a resurrection of the commercial service bureau concept of some 30 years ago.

Caution. Although these features have been tested on all platforms and are part of our standard release we may change the approach and introduce alternative mechanisms in the future.

Acknowledgments. Prof. Monique Guignard-Spielberg from the Wharton School at U Penn introduced us to parallel Lagrangian Relaxation on the SUN Grid Environment. Prof. Michael Ferris from the University of Wisconsin at Madison adopted our original GAMS grid approach to the high throughput system Condor and helped to make this approach a practical proposition.

Basic Concepts

The grid facility separates the solution into several steps which then can be controlled separately. We will first review what happens during the synchronous solution step and then introduce the asynchronous or parallel solution steps.

When GAMS encounters a solve statement during execution it proceeds in three basic steps:

  1. Generation. The symbolic equations of the model are used to instantiate the model using the current state of the GAMS data base. This instance contains all information and services needed by a solution method to attempt a solution. This representation is independent of the solution subsystem and computing platform.
  2. Solution. The model instance is handed over to a solution subsystem and GAMS will wait until the solver subsystem terminates.
  3. Update. The detailed solution and statistics are used to update the GAMS data base.

In most cases, the time taken to generate the model and update the data base with the solution will be much smaller than the actual time spent in a specific solution subsystem. The model generation takes a few seconds, whereas the time to obtain an optimal solution may take a few minutes to several hours. If sequential model solutions do not depend on each other, we can solve in parallel and update the data base in random order. All we need is a facility to generate models, submit them for solution and continue. At a convenient point in our GAMS program we will then look for the completed solution and update the data base accordingly. We will term this first phase the submission loop and the subsequent phase the collection loop:

Submission Loop.

In this phase we will generate and submit models for solutions that can be solved independently.

Collection Loop.

The solutions of the previously submitted models are collected as soon a solution is available. It may be necessary to wait for some solutions to complete by putting the GAMS program to 'sleep'.

Note that we have assumed that there will be no errors in any of those steps. This, of course, will not always be the case and elaborate mechanisms are in place to make the operation fail-safe.

A First Example

The model [QMEANVAR] from the GAMS Model Library will be used to illustrate the use of the basic grid facility. This model traces an efficiency frontier for restructuring an investment portfolio. Each point on the frontier requires the solution of independent quadratic mixed integer models. The original solution loop is shown below:

   Loop(p(pp),
      ret.fx = rmin + (rmax-rmin)/(card(pp)+1)*ord(pp) ;
      Solve minvar min var using miqcp ;
      xres(i,p)         = x.l(i);
      report(p,i,'inc') = xi.l(i);
      report(p,i,'dec') = xd.l(i) );

This loop will save the solutions to the model MINVAR for different returns RET. Since the solutions do not depend on the order in which they are carried out, we can rewrite this loop to operate in parallel. The first step is to write the submit loop:

   parameter h(pp) model handles;
   minvar.solvelink=3;
   Loop(p(pp),
      ret.fx = rmin + (rmax-rmin)/(card(pp)+1)*ord(pp) ;
      Solve minvar min var using miqcp;
      h(pp) = minvar.handle );

The model attribute .solvelink controls the behavior of the solve statement. A value of '3' tells GAMSto generate and submit the model for solution and continue without waiting for the completion of the solution step. There is a new model attribute .handle which provides a unique identification of the submitted solution request. We need to store those handle values, in this case in the parameter h, to be used later to collect the solutions once completed. This is then done with a collection loop:

   loop(pp$handlecollect(h(pp)),
      xres(i,pp)         = x.l(i);
      report(pp,i,'inc') = xi.l(i);
      report(pp,i,'dec') = xd.l(i) );

The function handlecollect interrogates the solution process. If the solution process has been completed the results will be retrieved and the function returns a value of 1. If the solution is not ready to be retrieved the value zero will be returned.

The above collection loop has one big flaw. If a solution was not ready it will not be retrieved. We need to call this loop several times until all solutions have been retrieved or we get tired of it and quit. We will use a repeat-until construct and the handle parameter h to control the loop to look only for the not yet loaded solutions as shown below:

Repeat
   loop(pp$handlecollect(h(pp)),
         xres(i,pp)         = x.l(i);
         report(pp,i,'inc') = xi.l(i);
         report(pp,i,'dec') = xd.l(i);
         display$handledelete(h(pp)) 'trouble deleting handles' ;
         h(pp) = 0 ) ;
   display$sleep(card(h)*0.2) 'sleep some time';
until card(h) = 0 or timeelapsed > 100;
xres(i,pp)$h(pp) = na;

Once we have extracted a solution we will set the handle parameter to zero. In addition, we want to remove the instance from the system by calling the function handledelete which returns zero if successful (see definition). No harm is done if it fails but we want to be notified via the conditional display statement. Before running the collection loop again, we may want to wait a while to give the system time to complete more solution steps. This is done with the sleep command that sleeps some time. The final wrinkle is to terminate after 100 seconds elapsed time, even if we did not get all solutions. This is important, because if one of the solution steps fails our program would never terminate. The last statement sets the results of the missed solves to NA to signal the failed solve. The parameter h will now contain the handles of the failed solvers for later analysis.

Alternatively, we could have used the function handlestatus and collect the solution which is stored in a GDX file. For example we could write:

loop(pp$(handlestatus(h(pp))=2),
   minvar.handle = h(pp);
   execute_loadhandle minvar;
   xres(i,pp)         = x.l(i);
   report(pp,i,'inc') = xi.l(i);
   report(pp,i,'dec') = xd.l(i) );

The function handlestatus interrogates the solution process and returns '2' if the solution process has been completed and the results can be retrieved. The solution is stored in a GDX file which can be loaded in a way similar to other gdx solution points. First we need to tell GAMS what solution to retrieve by setting the minvar.handle to the appropriate value. Then we can use execute_loadhandle to load the solution for model minvar back into the GAMS data base. Using handlestatus and loadhandle instead of the simpler handlecollect adds one more layer of control to the final collection loop. We now need one additional if statement inside the above collection loop:

Repeat
   loop(pp$h(pp),
      if(handlestatus(h(pp))=2,
         minvar.handle = h(pp);
         execute_loadhandle minvar;
         xres(i,pp)         = x.l(i);
         report(pp,i,'inc') = xi.l(i);
         report(pp,i,'dec') = xd.l(i);
         display$handledelete(h(pp)) 'trouble deleting handles' ;
         h(pp) = 0 ) ) ;
   display$sleep(card(h)*0.2) 'sleep some time';
until card(h) = 0 or timeelapsed > 100;
xres(i,pp)$h(pp) = na;

Now we are ready to run the modified model. The execution log will contain some new information that may be useful on more advanced applications:

     --- LOOPS pp = p1
     ---   46 rows 37 columns 119 non-zeroes
     ---   311 nl-code 7 nl-non-zeroes
     ---   14 discrete-columns
     --- Submitting model minvar with handle grid137000002
     --- Executing after solve
     ...
     --- GDXin=C:\answerv5\gams_srcdev\225j\grid137000003\gmsgrid.gdx
     --- Removing handle grid137000003

The log will now contain some additional information about the submission, retrieval and removal of the solution instance. In the following sections we will make use of this additional information. You can find a complete example of a grid enabled transport model in the GAMS Model Library.

At a final note, we have made no assumptions about what kind of solvers and what kind of computing environment we will operate. The above example is completely platform and solver independent and it runs on your Windows laptop or on a massive grid network like the Condor system without any changes in the GAMS source code.

Advanced Use of Grid Features

In this section we will describe a few special application requirements and show how this can be handled with the current system. Some of those applications may involve thousands of model instances with solution times of many hours each. Some may fail and require resubmission. More complex examples require communication and the use of GAMS facilities like the BCH (Branch & Cut & Heuristic) which submit other models from within a running solver.

Very Long Job Durations

Imagine a situation with thousands of model instances each taking between minutes and many hours to solve. We will break the master program into a submitting program, an inquire program and a final collection program. We will again use the previous example to demonstrate the principle. We will split the GAMS code of the modified [QMEANVAR] GAMS code into three components: qsubmit, qcheck and qreport.

The file qsubmit.gms file will include everything up to and including the new submit loop. To save the instances we will need a unique Grid Directory and to restart the problem we will have to create a save file. The first job will then look a follows.

> gams qsubmit s=submit gdir=c:\test\grid

The solution of all the model instances may take hours. From time to time I can then run a quick inquiry job to learn about the stats. The following program qcheck.gms will list the current status:

   parameter status(pp,*); scalar handle;
   acronym BadHandle,Waiting,Ready;
   loop(pp,
      handle := handlestatus(h(pp));
      if(handle=0,
         handle := BadHandle
      elseif handle=2,
         handle := Ready;
         minvar.handle = h(pp);
         execute_loadhandle minvar;
         status(pp,'solvestat') = minvar.solvestat;
         status(pp,'modelstat') = minvar.modelstat;
         status(pp,'seconds') = minvar.resusd;
      else
         handle := Waiting );
      status(pp,'status') = handle );
   display status;

To run the above program we will restart from the previous save file by using the restart or r parameter.

> gams qcheck r=submit gdir=c:\test\grid

The output may then look like:

   ----   173 PARAMETER status

       solvestat   modelstat   seconds     status

   p1      1.000       1.000     0.328      Ready
   p2      1.000       1.000     0.171      Ready
   p3                                     Waiting
   p4                                     Waiting
   p5      1.000       1.000     0.046      Ready

You may want to do some more detailed analysis on one of the solved model instances. Then we may have a qanalyze.gms program that may look like and be called using the double dash option, which sets a GAMS environment variable:

    $if not set instance $abort --instance is missing
    if(not handlestatus(h('%instance%')),
        abort$yes 'model instance %instance% not ready');
    minvar.handle = h('%instance%');
    execute_loadhandle minvar;
    display x.l,xi.l,xd.l;
    . . .

> gams qanalyze r=submit gdir=c:\test\grid --instance=p4

Once all jobs are completed we can continue with the second part which will contain the collection loop, for simplicity without the repeat loop because we would not run the final collection program unless we are satisfied that we got most of what we wanted. Then the qreport.gms file could look like:

   loop(pp$handlestatus(h(pp)),
      minvar.handle = h(pp);
      execute_loadhandle minvar;
      xres(i,pp)         = x.l(i);
      report(pp,i,'inc') = xi.l(i);
      report(pp,i,'dec') = xd.l(i);
      display$handledelete(h(pp)) 'trouble deleting handles' ;
      h(pp) = 0 );
   xres(i,pp)$h(pp) = na;
   . . .

We would restart the above program from the save file that was created by the submitting job like:

> gams qreport r=submit gdir=c:\test\grid

Note that it would not be necessary to run the job out of the same directory we did the initial submission. We don't even have to run the same operating system.

Summary of Grid Features

To facilitate the asynchronous or parallel execution of the solve solution steps we have introduced three new functions, a new model attribute, a new gdx load procedure and a new GAMS option GridDir.

When introducing the multi-threading facility we introduced another function and the GAMS option ThreadsAsync

Grid Handle Functions

HandleCollect(handle) collects (loads) the solution if ready.

0    the model instance was not ready or could not be loaded

>0    the model instance solution has been loaded (When using the Grid facility, this is always 1. When using the Multi-Threading option, this returns the thread ID used.)

Note: HandleCollect ignores the setting of the option SolveOpt, but uses the default value (merge) always.

HandleStatus(handle) returns the status of the solve identified by handle. An execution error is triggered if GAMS cannot retrieve the status of the handle.

0    the model instance is not known to the system

1    the model instance exists but no solution process is complete

2    the solution process has terminated and the solution is ready for retrieval

3    the solution process signaled completion but the solution cannot be retrieved

HandleDelete(handle) returns the status of the deletion of the handle model instance. In case of a nonzero return an execution error is triggered.

0    the model instance has been removed

1 the argument is not a legal handle

2 the model instance is not known to the system

3 the deletion of the model instance encountered errors

HandleSubmit(handle) resubmits a previously created instance for solution. In case of a nonzero return an execution error is triggered.

0    the model instance has been resubmitted for solution

1    the argument is not a legal handle

2    the model instance is not known to the system

3    the completion signal could not be removed

4    the resubmit procedure could not be found

5    the resubmit process could not be started

In addition, GAMS will issue execution errors which will give additional information that may help to identify the source of problems. The property execerror can be used to get and set the number of execution errors.

ReadyCollect(handles[, maxWait]) waits until a model is ready to be collected. Handles must be either a scalar/parameter containing one or more model handles or a model with its handle attribute. MaxWait defines the maximum time to wait. It is set to +INF if it is omitted.

0    (one of) the requested job(s) is ready

1    there is no active job to wait for

2    the handle symbol is empty

3    the argument is not a legal handle

4    user specified time-out (using a SolveLink = 6 handle)

5    user specified time-out (using a SolveLink = 3 handle)

8    unknown error (should not happen)

Grid Model Attributes

mymodel.solvelink specifies the solver linking conventions

0    automatic save/restart, wait for completion, the default

1    start the solution via a shell and wait

2    start the solution via spawn and wait

3    start the solution and continue (separate process space)

4    start the solution and wait (same submission process as 3)

5    start the solution via shared library and wait

6    start the solution and continue (separate thread)

7    start the solution and wait (same submission process as 6)

mymodel.handle specifies the current instance handle

This is used to identify a specific model instance and to provide additional information needed for the process signal management.

mymodel.number specifies the current instance number

Any time a solve is attempted for mymodel, the instance number is incremented by one and the handle is update accordingly. The instance number can be reset by the user which then resyncs the handle.

Grid Solution Retrieval

Execute_loadhandle mymodel;

This will update the GAMS data base with the status and solution for the current instance of mymodel. The underlying mechanism is a gdx file and operates otherwise like the execute_loadpoint procedure. Additional arguments can be used to retrieve information from the gdx solution file.

Grid Directory

The instantiated (generated) models and their corresponding solution are kept in unique directories, reachable from your submitting system. Each GAMS job can have only one Grid Directory. By default, the grid directory is assumed to be the scratch directory. This can be overwritten by using the GAMS parameter GridDir, or short GDir. For example:

> gams myprogram ... GDir=gridpath

If gridpath is not a fully qualified name, the name will be completed using the current directory. If the grid path does not exist, an error will be issued and the GAMSjob will be terminated. A related GAMS parameter is the ScrDir (SDir for short). Recall the following default mechanism:

When a GAMS job starts, a unique process directory is created in the current (job submitting), directory. These directories are named 225a to 225z. When a GAMS job terminates it will remove the process directory at the completion of a GAMS job. Any file that has not been created by the GAMS core system will be flagged.

Using the program gamskeep instead of gams will call another exit script which (the default script) will do nothing and the process directory will not be removed.

If we do not specify a scratch directory, the scratch directory will be the same as the process directory. If we do not specify a grid directory, the grid directory will be the same as the scratch directory.

If there is a danger that some of the model instances may fail or we want to break the GAMS program into several pieces to run as separate jobs, we need to be careful not to remove the model instance we have not completely processed. In such cases, we have to use the GridDir option in order to be able to access previously created model instances.

Architecture and Customization

The current Grid Facility relies on very basic operating system features and does not attempt to offer real and direct job or process control. The file system is used to signal the completion of a submitted task and GAMS has currently no other way to interact with the submitted process directly, like forcing termination or change the priority of a submitted task. This approach has its obvious advantages and disadvantages. There are a number of attempts to use grid computing to provide value added commercial remote computing services, notably is SUN's recent commercial entry. Commercial services require transparent and reliable charge approaches and related accounting and billing features which are still missing or inadequate.

When GAMS executes a solve under solvelink=3 it will perform the following steps:

  1. Create a subdirectory in the GridDir with the name gridnnn. Where nnn stands for the numeric value of the handle. The handle value is the internal symbol ID number x 1e6 + the model instance number. For example, in the [QMEANVAR] example the first grid subdirectory was grid137000002.
  2. Remove the completion signal in case the file already exists. Currently the signal is a file called finished. For example, grid137000002/finished.
  3. Create or replace a gdx file called gmsgrid.gdx which will contain a dummy solution with failed model and solver status. This file will be overwritten by the final step of the solution process and will be read when calling execute_loadhandle.
  4. Place all standard GAMSsolver interface files into the above instance directory.
  5. Execute the submission wrapper called gmsgrid.cmd under Windows or gmsgrid.run under Unix. These submission scripts are usually located in the GAMSsystem directory but are found via the current path if not found in the GAMSsystem directory.

The grid submission script gmsgrid.cmd or gmsgrid.run is called with three arguments needed to make a standard GAMSsolver call:

  1. The solver executable file name
  2. The solver control file name
  3. The solver scratch directory

The submission script then does the final submission to the operating system. This final script will perform the following steps:

   call the solver
   call a utility that will create the final gdx file gmsgrid.gdx
   set the completion signal finished

If we want to use the function handlesubmit() we also have to create the gmsrerun.cmd or gmsrerun.run script which could later be used to resubmit the job.

For example, the default submission script for Windows is shown below:

   @echo off
   : gams grid submission script
   :
   : arg1 solver executable
   :    2 control file
   :    3 scratch directory
   :
   : gmscr_nx.exe processes the solution and produces 'gmsgrid.gdx'
   :
   : note: %3 will be the short name because START cannot
   :       handle spaces and/or "...". We could use the original
   :       and use %~s3 which will strip ".." and makes the name :
   :       short
   : gmsrerun.cmd will resubmit runit.cmd
   echo @echo off                 > %3runit.cmd
   echo %1 %2                    >> %3runit.cmd
   echo gmscr_nx.exe %2   >> %3runit.cmd
   echo echo OK ^> %3finished    >> %3runit.cmd
   echo exit                     >> %3runit.cmd
   echo @start /b %3runit.cmd ^> nul > %3gmsrerun.cmd
   start /b %3runit.cmd > nul
   exit

Grid Submission Testing

The grid submission process can be tested on any GAMS program without having to change the source text. The solvelink=4 option instructs the solve statement to use the grid submission process and then wait until the results are available and then loads the solution into the GAMS data base. The solvelink option can be set via a GAMS command line parameter or via assignment to a the model attribute. Once the model instance has been submitted for solution, GAMS will check if the job has been completed. It will keep checking twice the reslim seconds allocated for this optimization job and report a failure if this limit has been exceed. After successful or failed retrieval of the solution GAMS will remove the grid directory, unless we have used gamskeep or have set the GAMS keep parameter.

Multi-Threading

Note: This feature is in beta status.

As described above, using the Grid facility, each solve is handled in its own process space. When setting the SolveLink option (or model attribute) to 6 instead (compile time constant %solveLink.Async Threads%) a separate thread is used, which allows efficient in-memory communication between GAMS and the solver (like it is done with SolveLink = %solveLink.Load Library% =5). Other than that, the multi-threading facility works in the same way as the Grid facility: After a solve statement, which generates the model and passes it to the solver in a separate thread, one can store a handle of the model instance (using the model attribute mymodel.handle) and use the same functions that are used for the Grid Facility to collect the solution and deal with the model instance: HandleCollect, HandleDelete, HandleStatus, and ReadyCollect.

The option ThreadsAsync (available on the command line and with the option statement) sets the maximum number of threads that should be used for the asynchronous solves.

The following matrix shows which solvers can be used with SolveLink = 6 on which platform:

Solver x86 32bit
MS Windows
x86 64bit
MS Windows
x86 64bit
Linux
x86 64bit
Mac OS X
x86 64bit
SOLARIS
Sparc 64bit
SOLARIS
IBM Power 64bit
AIX
GUROBI × × × × ×
SCIP × × × × ×
SNOPT ×
XPRESS × × × × × × ×
MOSEK × × × ×
CONOPTD × × × × × ×
CPLEXD × × × × × × ×
OsiCplex × × × × ×
OsiGurobi × × × ×

If a solver is selected for which SolveLink = 6 is not supported on the corresponding platform, SolveLink = 3 will be used instead (which is noted in the log).

Multi-threading Submission Testing

The multi-threading submission process can be tested on any GAMS program without having to change the source text. The solvelink=7 option (compile time constant %solveLink.Threads Simulate%) instructs the solve statement to use the multi-threading submission process and then wait until the results are available and then loads the solution into the GAMS data base. The solvelink option can be set via a GAMS command line parameter or via assignment to a the model attribute. Once the model instance has been submitted for solution, GAMS will check if the job has been completed. It will keep checking twice the reslim seconds allocated for this optimization job and report a failure if this limit has been exceed. After successful or failed retrieval of the solution GAMS will remove the thread handle.

Glossary and Definitions

BCH          Branch Cut Heuristic

Condor       High throughput computing system

GAMS        General Algebraic Modeling System

GDX          GAMS Data Exchange

HPC          High Performance Computing

SUN Grid Compute Utility