.. _sec-parallel:

**************************************************
Specification for Parallel Execution
**************************************************

Parallel computation includes distributed-memory parallelization and shared-memory parallelization, both of which are supported in PHITS.
A hybrid mode combining these two approaches is also available.

For distributed-memory parallelization, MPI must be installed on your computer.
For shared-memory parallelization, no additional protocols or software need to be installed.
However, when using the same number of CPU cores, distributed-memory parallelization often results in shorter computation times.

Switching between single and parallel execution is controlled by compiler options, and separate executable files must be created for each mode.
For details, see :numref:`sec-compile`.

In distributed-memory parallelization, jobs are assigned to each CPU core in units of batches, and the main core collects the results after all cores have completed their batch calculations.
Since all cores independently read geometry and tally data, the required memory is approximately proportional to the number of cores compared to single execution.
Therefore, this approach is not suitable for calculations requiring large amounts of memory, such as voxel phantom simulations.
In addition, since result collection waits for all cores to finish, computation time may become unnecessarily long if there is a large imbalance in processing time among cores, for example when the number of histories per batch (**maxcas**) is small.

In shared-memory parallelization, jobs are assigned to each core in units of histories, and all cores share geometry and tally data during the computation.
Therefore, memory usage is comparable to that of single execution.
However, since write access to memory may cause contention, computation time may become unnecessarily long for calculations that frequently write results, such as those using **[t-sed]**.


Distributed-Memory Parallelization
==================================================

.. _sec-MPIpara_setup:

Setup
--------------------------------------------------

To perform distributed-memory (MPI) parallel computation, an MPI implementation must be installed.

On Windows, when running parallel computations on a single PC, use the MPI included in the Intel oneAPI package.
For installation, refer to ``phits/document/Install-IntelFortran-OneAPI-en.pdf`` and :numref:`sec-compile-Win`.

When running parallel computations across multiple Windows PCs, refer to ``Windows-MPI-setup-jp.docx`` in the ``phits/document/mpi`` folder.

For macOS and Linux, install an MPI implementation by following appropriate instructions available on the internet.

Execution Using Batch Files or Shell Scripts
--------------------------------------------------

To run the MPI parallel version of PHITS on a single PC, add **$MPI=M** (**M** is the number of parallel processes) before the first section of the PHITS input file.

For example, to efficiently use a PC with four CPU cores:

.. code-block:: text

  $MPI = 4

Note that the actual number of processing elements (PEs) used in parallel execution is **M+1**, since one PE is used for control.

After saving the input file in this form, PHITS can be executed in MPI parallel mode using the standard execution procedure.

On Windows, when running for the first time, you will be prompted to enter a user name and password.

Currently, hybrid parallelization with OpenMP is not supported. If both **$OMP** (for OpenMP) and **$MPI** (for MPI) are specified in the input file, the one written later takes precedence.

.. _sec-paracomp_exe:

Execution from the Command Line
--------------------------------------------------

When running from the command-line interface provided on each OS, the execution command is, for example:

.. code-block:: console

  mpirun -np 5 phits_LinIfort_MPI

Here, ``mpirun`` is the executable of the installed MPI implementation, the number after **-np** specifies the number of processing elements (PEs), and ``phits_LinIfort_MPI`` is the PHITS executable.

Submit this command using a system-specific method such as ``qsub``.

In distributed-memory parallel mode, PHITS automatically reads the input file name from the file ``phits.in``.
The file name ``phits.in`` is fixed.

Write the input file name in the first line of ``phits.in`` as follows:

.. code-block:: text

  file = input_file_name

Therefore, input redirection from the shell is not supported.
This restriction applies only to distributed-memory parallel mode.

Alternatively, if you write **file=phits.in** in the first line, you can place the entire input data below it and execute PHITS.

Specification of **maxcas** and **maxbch**
--------------------------------------------------

In distributed-memory parallel computation, PHITS parallelizes the calculation in units of batches.

Therefore, the number of batches (**maxbch**) should be specified as a multiple of the number of processing elements used for computation (total PEs minus one, since one PE is used for control).

If this condition is not satisfied, the program automatically adjusts the values so that **maxbch** becomes a multiple and the total number of histories remains approximately unchanged.
In such cases, a comment is printed at the end of the input echo in the output.

Batch-wise information is output every (number of batches × (PE − 1)) in distributed-memory parallel mode.
Intermediate termination can also be performed at this unit.

For restart calculations (**istdev < 0**), **maxcas** is automatically adjusted to match previous results, so the total number of histories is not modified.
Only **maxbch** is adjusted to be a multiple of (PE − 1).

Handling of Abnormal Termination
--------------------------------------------------

If the program terminates abnormally, the corresponding PE is removed, and the calculation continues with the remaining PEs.

The final result is obtained by summing the results from the remaining PEs.
The status of each PE is reported in the batch information and in the calculation summary.

Output File Names for **Dump**, **dumpall**, and **[t-userdefined]**
--------------------------------------------------------------------

When using distributed-memory parallelization, output files for **Dump**, **dumpall**, and **[t-userdefined]** are split, and a number corresponding to the PE (e.g., ``.005``) is appended to each file name.

If the number of parallel processes is large (four or five digits), the number of digits in the suffix increases accordingly.

Specification of Input Files in PHITS
--------------------------------------------------

A typical example of files read by PHITS is a source file generated by Decay-Turtle.

The former is about 2.6 MB in size, and simultaneous access by all PEs is unlikely to impose a significant load on the network.
However, larger files (on the order of 100 MB) may cause issues when accessed repeatedly by multiple PEs.

In such cases, it is recommended to copy the data file to each PE’s working directory (e.g., ``/wk/j9999/turtle/sours.dat``) in advance, and specify it in the PHITS input as:

.. code-block:: text

  file = /wk/j9999/turtle/sours.dat


Shared-Memory Parallelization
==================================================


Execution Using Batch Files or Shell Scripts
--------------------------------------------------

The shared-memory parallel version of PHITS can be executed by adding **$OMP=N** (where **N** is the number of CPU cores to be used) before the first section of the PHITS input file, and then running PHITS in the usual manner.

When **N=1**, parallel computation is not used.
When **N=0**, all available CPU cores are used.

These settings are effective when PHITS is executed using ``phits.bat`` (on Windows) or ``phits.sh`` (on macOS and Linux).


Execution from the Command Line
--------------------------------------------------

When executing directly from the command line by specifying the executable, the procedure is the same as for the single version of PHITS:

.. code-block:: console

  phits_LinIfort_OMP < phits.inp

Here, ``phits_LinIfort_OMP`` is the executable file name.

Unlike distributed-memory parallelization, input redirection can be used, so the input file name (``phits.inp``) is flexible.

However, to specify the number of parallel threads, the environment variable ``OMP_NUM_THREADS`` must be set according to the number of CPU cores.

Although this variable is usually set to the number of cores by default, modern CPUs can process multiple threads per core, and the value may be larger than the number of physical cores.

In PHITS, setting ``OMP_NUM_THREADS`` to a value larger than the number of cores may lead to longer computation time due to contention in file write operations.
In such cases, adjust the environment variable manually as follows:

.. code-block:: console

  export OMP_NUM_THREADS=8

In hybrid parallel execution, the environment variable must be set individually on each compute node.

Since version 2.73, the Windows version of PHITS installs the 64-bit executable for shared-memory parallelization.
In the 32-bit version, increasing the number of cores could cause errors due to insufficient heap memory.
Using the 64-bit version may avoid this issue.

On Linux, the library ``libiomp5.so`` may be required for shared-memory parallel execution.
In such cases, follow the steps below.

First, install the ``libomp-dev`` package according to your distribution.
For example, on Ubuntu:

.. code-block:: console

  sudo apt-get install libomp-dev

This installs the library ``/usr/lib/x86_64-linux-gnu/libomp.so.5``.

Next, move to this directory:

.. code-block:: console

  cd /usr/lib/x86_64-linux-gnu/

Then create a symbolic link to ``libiomp5.so``:

.. code-block:: console

  sudo ln -s libomp.so.5 libiomp5.so

After this, the ``libiomp5.so`` library will be recognized, and shared-memory parallel execution with PHITS will be available.


Notes on Shared-Memory Parallel Computation
--------------------------------------------------

When using only one CPU core, the computation time of the shared-memory parallel version of PHITS is approximately twice that of the single version.
Therefore, for computers with two or fewer cores, there is little benefit in using shared-memory parallelization.

If a segmentation fault occurs during execution on Linux, it may be resolved by increasing the stack size before execution:

.. code-block:: console

  export OMP_STACKSIZE=1G

In this example, the stack size is set to 1 GB.

The results of shared-memory parallel computation are designed to be consistent with those of single execution.
If discrepancies are observed, they may indicate a bug, and you are encouraged to contact the PHITS office.