.. _guide:

User Guide
==========

.. _guide-pspace:

Constructing parameter spaces
-----------------------------

Parameter spaces are constructed with the `Param` class. You pass in
sequences of parameter values as keyword arguments. As long as you pass in at
least one sequence, other parameter values are allowed to be scalars and will be
replicated to match the sequence length.

>>> from psyrun import Param
>>> pspace = Param(a=[1, 2, 3], b=[4, 5, 6], c=7)

The `Param` object only stores the information to construct all
parameter assignments. Call the :meth:`~.ParameterSpace.build` method to construct
a dictionary with these parameter assignments. The dictionary will have the
parameter names as keys and lists of the assigned values in corresponding order
as values.

>>> from pprint import pprint
>>> pprint(pspace.build())
{'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 7, 7]}

So far it would have been easier to just enter the resulting dictionary in the
first place. But the `Param` class allows to easily construct more
complicated parameter spaces. If you multiply two `Param` instances,
this will result in the Cartesian product of the parameter values.

>>> pspace = Param(a=[1, 2, 3], b=[4, 5, 6]) * Param(c=[7, 8])
>>> pprint(pspace.build())
{'a': [1, 1, 2, 2, 3, 3], 'b': [4, 4, 5, 5, 6, 6], 'c': [7, 8, 7, 8, 7, 8]}

It is also possible to concatenate parameter spaces with the summation operator.

>>> pspace = Param(a=[1, 2]) + Param(a=[2, 3], b=[4, 4])
>>> pprint(pspace.build())
{'a': [1, 2, 2, 3], 'b': [nan, nan, 4, 4]}

As you can see, missing values will be filled with ``nan``.

There are two ways to exclude elements from a parameters space. The subtraction
operator removes everything with matching parameters.

>>> pspace = Param(a=[1, 2, 3], b=[1, 2, 3]) - Param(a=[2])
>>> pprint(pspace.build())
{'a': [1, 3], 'b': [1, 3]}

This however would raise an exception if there are additional parameters in the
subtrahend.

>>> pspace = Param(a=[1, 2, 3]) - Param(a=[2], c=[4])  # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
    ...
psyrun.pspace.AmbiguousOperationError: Key `c` not existent in minuend.

In this case, the `missing` function can be used to determine all parameter
assignments missing in the second parameter space.

>>> from psyrun.pspace import missing
>>> pspace = missing(Param(a=[1, 2, 3]), Param(a=[2], c=[4]))
>>> pprint(pspace.build())
{'a': [1, 3]}

With these basic operations it is possible to construct complicated
parameter spaces. For very large spaces you might not want to convert the whole
space into a dictionary at once. For this case the
:func:`.Param.iterate` method exists to iterate over the parameter assignments without
allocating more memory than necessary.

>>> pspace = Param(a=[1, 2]) * Param(b=[3, 4])
>>> for p in pspace.iterate():
...     pprint(p)
{'a': 1, 'b': 3}
{'a': 1, 'b': 4}
{'a': 2, 'b': 3}
{'a': 2, 'b': 4}


.. _guide-mapping:

Evaluating functions on parameter spaces
----------------------------------------

Once the parameter space is constructed, one probably wants to evaluate
a function on it. For this, the function needs to accept a set of parameters as
keyword arguments and it has to return its results as a dictionary. Here is
a simple example function:

>>> def basic_math(a, b):
...     return {'sum': a + b, 'product': a * b}

The :func:`.map_pspace` function allows to easily map such a function onto a
parameter space.

>>> from pprint import pprint
>>> from psyrun import map_pspace, pspace
>>> pspace = Param(a=[1, 2]) * Param(b=[3, 4])
>>> pprint(map_pspace(basic_math, pspace))
{'a': [1, 1, 2, 2],
 'b': [3, 4, 3, 4],
 'product': [3, 4, 6, 8],
 'sum': [4, 5, 5, 6]}

This will evaluate each set of parameters serially. If the evaluated function
itself is not parallelized, it is probably more efficient to do the evaluation
for different sets of parameter values in parallel. If you have
`joblib <https://pythonhosted.org/joblib/>`_ installed and your function can be
pickled (e.g., it can be imported from a Python module), you can use
:func:`.map_pspace_parallel` to parallelize the evaluation of parameter sets.

>>> from psyrun import map_pspace_parallel
>>> from psyrun.utils.example import square
>>> pprint(map_pspace_parallel(square, Param(x=[1, 2, 3])))
{'x': [1, 2, 3], 'y': [1, 4, 9]}


Psyrun command line interface
-----------------------------

All Psyrun commands are invoked with ``psy <subcommand>``. The available
subcommands are described in the following. The ``psy`` command looks for task
definitions in the *psy-tasks* directory relative to its working directory, but
a different location can be provided with the ``--taskdir``
argument. To get help about the ``psy`` command or any subcommand use ``psy
--help`` and ``psy <subcommand> --help``.

run
^^^

``psy run [-h] [--taskdir TASKDIR] [task [task ...]]``

Without further arguments this executes all tasks that are not up-to-date. Each
subtask will be printed out prefixed either with ``.`` (if the task is
executed) or ``-`` if the task is skipped. This corresponds to the conventions
used by `doit <http://pydoit.org/>`_. It is possible to only execute a subset
of tasks by explicitly naming them as arguments to the ``run`` subcommand.

Furthermore, the ``-c`` or ``--continue`` argument is accepted to preserve and
add to existing results.

clean
^^^^^

``psy clean [-h] [--taskdir TASKDIR] [task [task ...]]``

Clean one or more tasks passed as arguments to the command. This means
all files generated for the task will be deleted.

kill
^^^^

``psy kill [-h] [--taskdir TASKDIR] [task [task ...]]``

Kill all running and queued jobs of the tasks passed as arguments to the
command.

list
^^^^

``psy list [-h] [--taskdir TASKDIR]``

List the name of all tasks.

merge
^^^^^

``psy merge [-h] directory merged``

Merges all output files in *directory* into a single file *merged*. The filename
extension of *merged* is used to determine the input and output format.

.. _guide-cmd-new-task:

new-task
^^^^^^^^

``psy new-task [-h] [--taskdir TASKDIR] [--scheduler SCHEDULER] name``

Creates a new template task with given name. It will use template parameters for
the given scheduler.

status
^^^^^^

``psy status [-h] [--taskdir TASKDIR] [-v] [task [task ...]]``

Prints the status of all tasks or the tasks given as arguments. Use the ``-v``
flag for a more verbose output including parameter assignments that have not
been evaluated yet.


.. _cmd-test:

test
^^^^

``psy test [-h] [--taskdir TASKDIR] [task [task ...]]``

Runs a single parameter assignment for each task or each task given as argument
to test that it successfully executes. It does not use the scheduler defined in
the task file to submit jobs, but will directly run them.


Diagnosing problems
-------------------

When some of the ``psy run`` jobs fail it is important to find out what caused
them to fail. Each job creates a log file that might contain useful information
(for example the exception that halted the job). These logfiles will be written
to the task directory in the workdir which by default is ``psy-work/<task
name>``. The logfile will be named using the scheme
``<task name>:<job name>.log`` where the job name will be one of ``split``,
``process:<number>``, ``merge`` for the `DistributeBackend`.


.. _guide-task-files:

Writing task-files
------------------

Each task is defined in a Python file with the name ``task_<name>.py``. That
means any valid Python code can be used in the definition of the task. You can
create template task files with the :ref:`new-task command
<guide-cmd-new-task>`. There are certain module level variables that have
a special meaning. The two most important are ``pspace``, defining the parameter
space to explore, and ``execute`` defining the function to evaluate a single
parameter assignment.

Also consider setting ``store`` to either `H5Store` or `NpzStore`. This
requires additional dependencies to be installed and imposes some limitations
on the data, but can improve performance. See :ref:`guide-stores` for more
details.

It is likely that you also want to adjust *max_jobs* (maximum number of
processing jobs to submit to process the task) and *min_items* (minimum
number of items to process with each processing jobs). If each parameter
assignment is evaluated quickly, it can be beneficial to increase *min_items*
to avoid the overhead of starting a lot of jobs. By default *max_jobs* is set
to 100 as on high performance clusters there might be a penalty or limit on the
number of jobs one can submit at a time.

If you want to run a task on a high performance cluster, it will be necessary
to set *scheduler* to the appropriate scheduler. Otherwise, jobs will be run
serially and immediately. There is also a *schedular_args* variable which
allows to define a dictionary of additional required arguments for the
scheduler. These will depend on the scheduler used, see :ref:`guide-schedulers`
for more details.
High performance clusters might offer different file systems with different
access speed. In that case you might want to set *workdir*, the directory
where intermediary files are written to, and *resultfile*, the file results
are written to, to appropriate locations.

By default Psyrun will split the parameters space in equally sized batches. If
parameter assignment can vary in their execution time, it might be beneficial
to use a load balancing approach by setting *backend* to
`LoadBalancingBackend`. See :ref:`guide-backends` for more details.

All special variables are documented as part of the `psyrun.tasks.Config`
documentation.

This is what a task file to run on the `Sharcnet <https://www.sharcnet.ca>`_
might look like::

    import numpy as np
    from psyrun import Param, Sqsub
    from psyrun.store.npz import NpzStore


    pspace = Param(radius=np.linspace(0., 1., 100)) * Param(trial=np.arange(50))
    min_items = 10
    store = NpzStore()
    workdir = '/work/user/mc_circle_area'
    scheduler = Sqsub(workdir)
    scheduler_args = {
        'timelimit': '15m',
        'memory': '1G',
    }


    def execute(radius, trial):
        n = 100
        x = np.random.random((n, 2)) * 2. - 1.
        return {'a_frac': np.mean(np.linalg.norm(x, axis=1) < radius), 'x': x}


.. _guide-stores:

Data stores
-----------

Psyrun can use different “data stores” to persist data to the hard drive. It
provides three stores with different advantages and disadvantages described in
the following. It is possible to use `AutodetectStore` to select the appropriate
store based on the filename extension.

Note that Psyrun almost always needs to merge multiple data files and thus the
performance of appending to an existing data file can be quite relevant.
The only store that supports efficient appending is the `H5Store` at the moment.
If you have the possibility to use it, it should probably be your first choice.
The `NpzStore` should be the second choice. The default `PickleStore` is the
least efficient choice, but provides support for the widest range of data types
and has no additional dependencies.

To use other data formats than the three provided ones, implement the `Store`
interface and provide it as an
`entry point <https://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery-of-services-and-plugins>`_
in the group ``psyrun.stores``. For example, add the following to the ``setup``
call in your store's ``setup.py`` for a store providing the ``.ext`` format::

    entry_points={
        'psyrun.stores': ['.ext = pkg.name:ClassName'],
    }

pickle
^^^^^^

The `PickleStore` is the default because it has no additional dependencies and
supports all data types that can be pickled. It can be slow with large data
files and appending requires the complete file to be loaded and rewritten.

NumPy NPZ
^^^^^^^^^

The `NpzStore` requires `NumPy <http://www.numpy.org/>`_ and is more efficient
than the `PickleStore`. It will, however, still require to read and rewrite the
complete data file for appending data.

HDF5
^^^^

The `H5Store` requires `PyTables <http://www.pytables.org/>`_ and provides
efficient appends to existing data files. However, it only supports numeric
data types.


.. _guide-backends:

Backends
--------

Backends determine how work is distributed to a number of jobs. By default
Psyrun will use the `DistributeBackend` that will use one job to split the
parameter space in equally sized batches and process them with up to
*max_jobs* processing jobs (each batch will have at least *min_items* items
to process). After all processing jobs are finished all the results will be
merged into a single file by another job. This is similar to `map-reduce
processing <https://de.wikipedia.org/wiki/MapReduce>`_.

If evaluating different parameter sets can take a different amount of time,
this might lead to some jobs finishing very early, while others take a long
time. Thus the computational resources are not used optimally. In that case in
can be beneficial to use load balancing with the `LoadBalancingBackend`. This
backend will start *max_jobs* and each will fetch single items to process until
all items have been processed. Thus, if a job is finished early with one item,
it just fetches the next and continues. This gives a better use of the
computational resources, but also has some disadvantages: It requires to load
specific single rows from an input file which is only supported efficiently by
the `H5Store`. Also the order in which the results are written becomes
non-deterministic which makes it computationally more expensive to determine
what parameter assignments have to be rerun if some of them failed to execute.


.. _guide-schedulers:

Schedulers
----------

Schedulers define how Psyrun submits individual jobs. The default is
`ImmediateRun` which is not really a scheduler because it just immediately runs
any job on submission. Psyrun comes with support for
`Slurm Workload Manager <https://slurm.schedmd.com/>`_ (used by `Compute Canada
<http://computecanada.ca/>`_'s new clusters and Sharcnet's ``sqsub`` scheduler.
For other schedulers it is necessary to write some custom code.


Slurm scheduler (e.g., Compute Canada)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `Slurm` scheduler uses ``sbatch`` to submit jobs. It accepts the following
*scheduler_args* (corresponding ``sbatch`` command line options are given in
parenthesis):

* *timelimit* (``-t``): String stating the execution time limit for
  each individual job.
* *memory* (``--mem``): String stating the memory limit per node.
* *memory_per_cpu* (``--mem-per-cpu``): String stating the minimum memory
  required per CPU.
* *n_cpus* (``-c``): Number of CPU cores to allocate for each task.
* *n_nodes* (``-N``): Number of nodes to allocate for each individual
  job.
* *cores-per-socket* (``--cores-per-socket``): Minimum number of cores per
  socket.
* *sockets-per-node* (``--sockets-per-node``): Minimum number of sockets per
  node.

For more details see `the sbatch help <https://slurm.schedmd.com/sbatch.html>`_.
Not all options that can be passed to ``sbatch`` are currently supported.
Please `open a new issue <https://github.com/jgosmann/psyrun/issues/new>`_ if
you require support for further options.

Instead of a fixed value, you can also assign a function accepting the job
name as single argument to `Slurm` scheduler arguments. The function will be
called with the job name to determine the value of the argument.


Sqsub scheduler (Sharcnet)
^^^^^^^^^^^^^^^^^^^^^^^^^^

The `Sqsub` scheduler uses ``sqsub`` to submit jobs. It accepts the following
*scheduler_args* (corresponding ``sqsub`` command line options are given in
parenthesis):

* *timelimit* (required, ``-r``): String stating the execution time limit for
  each individual job.
* *n_cpus* (optional, default 1, ``-n``): Number of CPU cores to allocate for
  each individual job.
* *n_nodes* (optional, ``-N``): Number of nodes to allocate for each individual
  job.
* *memory* (required, ``--mpp``): String stating the memory limit for each
  individual job.

For more details see the ``sqsub`` help.

Instead of a fixed value, you can also assign a function accepting the job
name as single argument to `Sqsub` scheduler arguments. The function will be
called with the job name to determine the value of the argument.


Interfacing other schedulers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To support other schedulers, it is necessary to implement the `Scheduler`
interface. The central function is `Scheduler.submit` that will be invoked to
submit a job. Furthermore, functions to obtain the status
(`Scheduler.get_status`), return running and queued jobs
(`Scheduler.get_jobs`), and kill jobs `Scheduler.kill` are required. It can be
instructive to read the `Sqsub` source code before
implementing a scheduler.


Recipes
-------

This section collects code examples for common tasks.


Convert results to a Pandas data frame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note that this recipe requires all single parameter values and outputs to be
scalars as Pandas does not support multi-dimensional data.

.. code-block:: python

    import pandas as pd
    import psyrun

    store = psyrun.store.PickleStore()  # insert appropriate store here
    df = pd.DataFrame(store.load('path/to/datafile.pkl'))


Random search
^^^^^^^^^^^^^

`Random search can be more efficient than a grid search.
<http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf>`_
A random search can be easily done with Psyrun by sampling the parameter values
from random distributions. It is best to use a seed, so that results are
reproducible. Change the seed to obtain another set of evaluations for different
parameters.

.. code-block:: python

    import numpy as np
    from psyrun import Param

    seed = 42
    rng = np.random.RandomState(seed)
    n_evaluations = 100

    pspace = Param(
        p1=rng.uniform(-1., 1., n_evaluations),
        p2=rng.normal(0., 1., n_evaluations))


Utilize multiple GPUs with load balancing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following shows a task file that runs multiple instances of PyOpenCL code
in a load balancing fashion on multiple GPUs.

.. code-block:: python

   from psyrun.backend import LoadBalancingBackend
   import pyopencl

   # Define you parameter space
   # pspace = ?

   backend = LoadBalancingBackend
   pool_size = 4  # Adjust to the number of GPUs you have
   max_jobs = 1  # Single job that will start multiple parallel processes.

   exclude_from_result = ['cl_context']  # The context cannot be saved to disk.

   def setup(proc_id):
       # You might need to adjust the 0 on your system to whatever index
       # your GPU device group has.
       return {
           'cl_context': pyopencl.create_some_context(answers=[0, proc_id]),
       }


   def execute(cl_context, **kwargs):
       result = {}
       # Do your GPU calculations using cl_context here
       return result