11. Specification for Parallel Execution

Parallel computation includes distributed-memory parallelization and shared-memory parallelization, both of which are supported in PHITS. A hybrid mode combining these two approaches is also available.

For distributed-memory parallelization, MPI must be installed on your computer. For shared-memory parallelization, no additional protocols or software need to be installed. However, when using the same number of CPU cores, distributed-memory parallelization often results in shorter computation times.

Switching between single and parallel execution is controlled by compiler options, and separate executable files must be created for each mode. For details, see Section 10.

In distributed-memory parallelization, jobs are assigned to each CPU core in units of batches, and the main core collects the results after all cores have completed their batch calculations. Since all cores independently read geometry and tally data, the required memory is approximately proportional to the number of cores compared to single execution. Therefore, this approach is not suitable for calculations requiring large amounts of memory, such as voxel phantom simulations. In addition, since result collection waits for all cores to finish, computation time may become unnecessarily long if there is a large imbalance in processing time among cores, for example when the number of histories per batch (maxcas) is small.

In shared-memory parallelization, jobs are assigned to each core in units of histories, and all cores share geometry and tally data during the computation. Therefore, memory usage is comparable to that of single execution. However, since write access to memory may cause contention, computation time may become unnecessarily long for calculations that frequently write results, such as those using [t-sed].

11.1. Distributed-Memory Parallelization

11.1.1. Setup

To perform distributed-memory (MPI) parallel computation, an MPI implementation must be installed.

On Windows, when running parallel computations on a single PC, use the MPI included in the Intel oneAPI package. For installation, refer to phits/document/Install-IntelFortran-OneAPI-en.pdf and Section 10.1.

When running parallel computations across multiple Windows PCs, refer to Windows-MPI-setup-jp.docx in the phits/document/mpi folder.

For macOS and Linux, install an MPI implementation by following appropriate instructions available on the internet.

11.1.2. Execution Using Batch Files or Shell Scripts

To run the MPI parallel version of PHITS on a single PC, add $MPI=M (M is the number of parallel processes) before the first section of the PHITS input file.

For example, to efficiently use a PC with four CPU cores:

$MPI = 4

Note that the actual number of processing elements (PEs) used in parallel execution is M+1, since one PE is used for control.

After saving the input file in this form, PHITS can be executed in MPI parallel mode using the standard execution procedure.

On Windows, when running for the first time, you will be prompted to enter a user name and password.

Currently, hybrid parallelization with OpenMP is not supported. If both $OMP (for OpenMP) and $MPI (for MPI) are specified in the input file, the one written later takes precedence.

11.1.3. Execution from the Command Line

When running from the command-line interface provided on each OS, the execution command is, for example:

mpirun -np 5 phits_LinIfort_MPI

Here, mpirun is the executable of the installed MPI implementation, the number after -np specifies the number of processing elements (PEs), and phits_LinIfort_MPI is the PHITS executable.

Submit this command using a system-specific method such as qsub.

In distributed-memory parallel mode, PHITS automatically reads the input file name from the file phits.in. The file name phits.in is fixed.

Write the input file name in the first line of phits.in as follows:

file = input_file_name

Therefore, input redirection from the shell is not supported. This restriction applies only to distributed-memory parallel mode.

Alternatively, if you write file=phits.in in the first line, you can place the entire input data below it and execute PHITS.

11.1.4. Specification of maxcas and maxbch

In distributed-memory parallel computation, PHITS parallelizes the calculation in units of batches.

Therefore, the number of batches (maxbch) should be specified as a multiple of the number of processing elements used for computation (total PEs minus one, since one PE is used for control).

If this condition is not satisfied, the program automatically adjusts the values so that maxbch becomes a multiple and the total number of histories remains approximately unchanged. In such cases, a comment is printed at the end of the input echo in the output.

Batch-wise information is output every (number of batches × (PE − 1)) in distributed-memory parallel mode. Intermediate termination can also be performed at this unit.

For restart calculations (istdev < 0), maxcas is automatically adjusted to match previous results, so the total number of histories is not modified. Only maxbch is adjusted to be a multiple of (PE − 1).

11.1.5. Handling of Abnormal Termination

If the program terminates abnormally, the corresponding PE is removed, and the calculation continues with the remaining PEs.

The final result is obtained by summing the results from the remaining PEs. The status of each PE is reported in the batch information and in the calculation summary.

11.1.6. Output File Names for Dump, dumpall, and [t-userdefined]

When using distributed-memory parallelization, output files for Dump, dumpall, and [t-userdefined] are split, and a number corresponding to the PE (e.g., .005) is appended to each file name.

If the number of parallel processes is large (four or five digits), the number of digits in the suffix increases accordingly.

11.1.7. Specification of Input Files in PHITS

A typical example of files read by PHITS is a source file generated by Decay-Turtle.

The former is about 2.6 MB in size, and simultaneous access by all PEs is unlikely to impose a significant load on the network. However, larger files (on the order of 100 MB) may cause issues when accessed repeatedly by multiple PEs.

In such cases, it is recommended to copy the data file to each PE’s working directory (e.g., /wk/j9999/turtle/sours.dat) in advance, and specify it in the PHITS input as:

file = /wk/j9999/turtle/sours.dat

11.2. Shared-Memory Parallelization

11.2.1. Execution Using Batch Files or Shell Scripts

The shared-memory parallel version of PHITS can be executed by adding $OMP=N (where N is the number of CPU cores to be used) before the first section of the PHITS input file, and then running PHITS in the usual manner.

When N=1, parallel computation is not used. When N=0, all available CPU cores are used.

These settings are effective when PHITS is executed using phits.bat (on Windows) or phits.sh (on macOS and Linux).

11.2.2. Execution from the Command Line

When executing directly from the command line by specifying the executable, the procedure is the same as for the single version of PHITS:

phits_LinIfort_OMP < phits.inp

Here, phits_LinIfort_OMP is the executable file name.

Unlike distributed-memory parallelization, input redirection can be used, so the input file name (phits.inp) is flexible.

However, to specify the number of parallel threads, the environment variable OMP_NUM_THREADS must be set according to the number of CPU cores.

Although this variable is usually set to the number of cores by default, modern CPUs can process multiple threads per core, and the value may be larger than the number of physical cores.

In PHITS, setting OMP_NUM_THREADS to a value larger than the number of cores may lead to longer computation time due to contention in file write operations. In such cases, adjust the environment variable manually as follows:

export OMP_NUM_THREADS=8

In hybrid parallel execution, the environment variable must be set individually on each compute node.

Since version 2.73, the Windows version of PHITS installs the 64-bit executable for shared-memory parallelization. In the 32-bit version, increasing the number of cores could cause errors due to insufficient heap memory. Using the 64-bit version may avoid this issue.

On Linux, the library libiomp5.so may be required for shared-memory parallel execution. In such cases, follow the steps below.

First, install the libomp-dev package according to your distribution. For example, on Ubuntu:

sudo apt-get install libomp-dev

This installs the library /usr/lib/x86_64-linux-gnu/libomp.so.5.

Next, move to this directory:

cd /usr/lib/x86_64-linux-gnu/

Then create a symbolic link to libiomp5.so:

sudo ln -s libomp.so.5 libiomp5.so

After this, the libiomp5.so library will be recognized, and shared-memory parallel execution with PHITS will be available.

11.2.3. Notes on Shared-Memory Parallel Computation

When using only one CPU core, the computation time of the shared-memory parallel version of PHITS is approximately twice that of the single version. Therefore, for computers with two or fewer cores, there is little benefit in using shared-memory parallelization.

If a segmentation fault occurs during execution on Linux, it may be resolved by increasing the stack size before execution:

export OMP_STACKSIZE=1G

In this example, the stack size is set to 1 GB.

The results of shared-memory parallel computation are designed to be consistent with those of single execution. If discrepancies are observed, they may indicate a bug, and you are encouraged to contact the PHITS office.