STFC
MPI für Kohlenforschung

University College London

Parallel Execution of ChemShell

Introduction

This section outlines the procedures to run ChemShell on parallel computing platforms using MPI. It should be emphasised that due to the variations in operating systems between platform types (and indeed, between different installations running the same machines) that these instructions may need to be modified to reflect local conditions.

External codes

It is possible in principle to launch parallel QM calculations without building or running ChemShell in parallel. Depending on the QM code, use specific environment variables to invoke them in parallel. In some cases, the argument "nproc=" has to be specified as argument to the QM codes. Consult the manual page of the relevant code for more details.

This approach generally works well only on workstations or locally-managed clusters. Other parallel environments, particularly HPC systems, may not allow parallel binaries to be launched from another binary (e.g. if mpirun is not available on the compute nodes). In this case, you must use a code that can be directly linked into a parallel build of ChemShell (see below).

Parallel ChemShell

ChemShell can be executed in parallel if it has been compiled using the --with-mpi option (see the installation guide). The advantages of running ChemShell in parallel include execution of some internal modules such as dl_poly in parallel, and the possibility to directly link external codes (see below).

The chemsh script contains the mpirun (or equivalent) command to launch the ChemShell binary (chemsh.x) in parallel. Do not pass the chemsh script as an argument to mpirun as this will result in an error. The -p argument is used to specify the number of processes required.

As parallel execution is very platform-dependent, you may need to modify the chemsh script for your platform, particularly if mpirun is not used. Alternatively, it may be more convenient to call the ChemShell binary directly. In this case, the environment variables that are normally set by the chemsh script must be set manually. For example:

ROOT=/home/ [your path] /chemsh
export TCLLIBPATH=${ROOT}/tcl
export TCL_LIBRARY=/home/ [your path to tcl] /lib/tcl8.5
export PATH=${PATH}:${ROOT}/scripts

mpirun -np [no. of processes] chemsh.x input.chm > output.log

Direct linking of external codes

Some external codes (such as GAMESS-UK and GULP) can be linked in directly as libraries to ChemShell so that the package may be executed in parallel as a single binary. This is more efficient than launching the external codes separately, and is essential on platforms where it is not possible to launch one parallel binary from another (e.g. most HPC facilities do not allow this).

The supported external codes must usually be compiled using special flags to produce a library. Once this is done, direct linking must be specified when ChemShell is configured using arguments such as --with-gamess-uk, etc. For further information, see the installation guide.

Once compiled, ChemShell is executed in parallel as described above. For GAMESS-UK and GULP the directly-linked library is automatically used, but for other codes it may be necessary to set the variable linked to 1 in the Tcl interface file.

Hints

Caching ChemShell Objects

When running on parallel machine it is particularly important to reduce I/O requirements of the job by caching ChemShell objects in memory. See the section on
object declarations.

Executing shell commands on all processors

The command pexec_module will run a command on all processors of the parallel machine. It takes three arguments: a name (for listing purposes only), the command to be executed and a file for use on stdin (NULL denotes no standard input)

e.g.

pexec_module hostname hostname NULL

Will list the hostnames of all processors available to the parallel job. More complex commands can be useful for housekeeping tasks like creating and removing scratch directories.

Task-farming parallelism

MPI parallel builds of ChemShell can run in a task-farm parallel mode. The standard parallel mode is a master/slave configuration in which a Tcl interpreter is run on the master node and the slave nodes remain idle until a parallel input command is executed. All nodes are grouped together in the default MPI World communicator. In the task-farming parallel mode the MPI World communicator is split to form smaller sets of processors called workgroups. The number of workgroups is fixed for the lifetime of the ChemShell run and is specified using the command-line argument nworkgroups, which is passed to ChemShell when it is executed. For example, the following command would run a 64-processor job using 4 workgroups of 16 processors each:

chemsh -p 64 input.chm -nworkgroups 4 > output.log

In each workgroup one node acts as a master and the rest are slaves. Each master runs a copy of the Tcl interpreter and independently executes the same ChemShell input script.

The following Tcl commands can be used to report workgroup information:

  • workgroupid - unique workgroup number, counting from 0
  • nworkgroups - total number of workgroups in the task farm

These commands may be used to make parts of the input script conditional on the workgroup ID and this provides a mechanism to distribute tasks between workgroups.

To prevent file conflicts, a scratch working directory is created for each workgroup (workgroup0, workgroup1 ...). By default ChemShell objects will be loaded from the working directory, but if not present the common parent directory will also be searched. Objects in these directories may be thought of as local and global respectively. Global objects may not be created directly, but local objects may be 'globalised' using the command:

taskfarm_globalise_objects object1 object2 ...

Essentially this moves the objects from the workgroup directory they are in to the parent directory. Note that priority is given to objects in workgroup0/ followed by 1, 2, and so on. The workgroups are synchronised with an MPI barrier before and after the operation, again to avoid file conflicts.

The workgroups may be manually synchronised using the command synch_workgroups.

The task farm framework is particularly useful in combination with DL-FIND. NEB optimisations, finite-difference Hessian calculations and global minimisations (genetic algorith/stochastic search) may be run in task-farmed mode by calling the dl-find command as normal.

Note when restarting a task-farmed DL-FIND calculation: DL-FIND checkpoint files are only written by workgroup 0 and only after data has been shared (e.g. after a full NEB cycle or Hessian evaluation). On restart they therefore need to be copied manually into the other workgroup scratch directories using the following code:

if { [ workgroupid ] != 0 } {
  set chklist [glob ../workgroup0/*.chk]
  foreach chk $chklist {
    file copy -force $chk .
  }
}

The force module may also be run in task-farmed mode. For more information see the force documentation.





This manual was generated using htp, an HTML pre-processor Powered by htp