Special Features
================

Communication protocols
-----------------------

2 internal communication protocols are used in |Supvisors|.

XML-RPC publication
~~~~~~~~~~~~~~~~~~~

The main protocol implemented in |Supvisors| is based on the XML-RPC protocol provided by |Supervisor|. It is used to
share the local events to the other |Supvisors| instances.

The XML-RPC protocol was originally discarded because it led easily to deadlocks when involving requests to multiple
|Supervisor| instances. So a first implementation has been done based on a PyZmq PUB-SUB. It then has been replaced
by a custom implementation to limit the mandatory dependencies and to have a better control over the underlying threads
and sockets. In both case, the events were sent over a TCP socket and posted sequentially to the local |Supervisor|
using a ``supervisor.sendRemoteCommEvent`` XML-RPC.

Finally, with a proper understanding of the limitations brought by the XML-RPC implementation and its non-thread-safe
nature, the |Supvisors| design has been simplified so that the local events and requests are processed in threads
dedicated to each Supervisor proxy.

UDP Multicast
~~~~~~~~~~~~~

The second protocol implemented in |Supvisors| is based on an **UDP Multicast**. It relies on the following options
in the ``[supvisors]`` section in the |Supervisor| configuration file:

    * ``multicast_group``;
    * ``multicast_interface``;
    * ``multicast_ttl``.

With this protocol, the |Supvisors| instances could be unknown at start-up and will be discovered on-the-fly.
The UDP Multicast group is used to exchange ticks. Upon reception of a tick coming from an unknown |Supvisors| instance,
the local |Supvisors| instance adds the remote |Supvisors| instance into its internal model, and starts to exchange
events with it.

.. note::

    Although it has been considered at some point, the idea of having |Supvisors| working only in UDP Multicast,
    without the TCP Publish / Subscribe, has been discarded. |Supvisors| cannot afford to lose events or to receive
    them in an inappropriate sequence.

.. _synchronizing:

Synchronizing |Supvisors| instances
-----------------------------------

The overall design of |Supvisors| is to add a |Supvisors| plugin into every |Supervisor| instance, and to make them
share the events generated by |Supervisor| with each other.

To that end, a communication protocol needs to be put in place place between all |Supvisors| instances.
Given the objectives of |Supvisors|, a polling mechanism doesn't fit. All |Supervisor| events have to be known
and processed, so an event-driven protocol is naturally considered.

When the |Supvisors| extension is started within |Supervisor|, the |Supvisors| instance establishes a comprehensive
picture of the network, as perceived by the local host.

Then |Supvisors| instance publishes the events received from |Supervisor|, especially the ``TICK`` events,
that are triggered every 5 seconds. The publication is performed towards all |Supvisors| instances discovered or
declared in the ``supvisors_list`` option of the ``[supvisors]`` section in the |Supervisor| configuration file,
with the exception of ``ISOLATED`` instances.

The reception of the first ``TICK`` event opens a *Handshake* protocol between the receiver and the emitter.

.. _handshake:

Handshake protocol
~~~~~~~~~~~~~~~~~~

When the local |Supvisors| instance is started, all |Supvisors| instances are internally declared in an ``STOPPED``
state. When the first ``TICK`` event is received from a remote |Supvisors| instance, a handshake is performed
between the local |Supvisors| instance and the remote |Supvisors| instance.

First, the local |Supvisors| instance sets the remote |Supvisors| instance state to ``CHECKING``.

Then the local |Supvisors| instance gets a lot of information from the remote |Supvisors| instance using XML-RPC:

    * the Network picture using ``supvisors.get_network_info()``,
    * the perception of the local |Supvisors| instance using ``supvisors.get_instance_info(local_identifier)``,
    * the configured strategies using ``supvisors.get_strategies()``,
    * the perception of |Supvisors| state and modes using ``supvisors.get_instance_state_modes()``.

If any inconsistency is detected when comparing the respective configured strategies, the handshake will fail
and the remote |Supvisors| instance will be marked as ``ISOLATED`` by the local |Supvisors| instance.

At this stage, 2 possibilities:

    * the local |Supvisors| instance is seen as ``ISOLATED`` by the remote instance:

        + the remote |Supvisors| instance status is then reciprocally set to ``ISOLATED``;

    * the local |Supvisors| instance is NOT seen as ``ISOLATED`` by the remote instance:

        + a ``supvisors.get_all_local_process_info()`` XML-RPC is requested to the remote instance,
        + the processes information is loaded into the internal data structure,
        + the remote |Supvisors| instance status is set to ``CHECKED``.

As soon as the |Supvisors| instance is ``CHECKED``, all other events are shared.

Principles of Synchronization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``SYNCHRONIZATION`` state is reached as soon as the local |Supvisors| instance has performed a handshake against
itself. Although it may sound weird, it is a common way to load a |Supvisors| instance context into the internal
model.

The ``SYNCHRONIZATION`` state of |Supvisors| is used as a synchronization phase so that all |Supvisors| instances
are mutually aware of each other.

The following options defined in the :ref:`supvisors_section` of the |Supervisor| configuration file are particularly
used for synchronizing multiple instances of |Supervisor|:

    * ``supvisors_list``;
    * ``synchro_options``;
    * ``synchro_timeout``;
    * ``core_identifiers``;
    * ``auto_fence``.

What happens next will depend on the conditions selected in the ``synchro_options`` option.

``STRICT`` option
*****************

When the ``STRICT`` option is selected, the synchronization is complete when all |Supvisors| instances declared
in the ``supvisors_list`` option are marked as ``RUNNING``.
This excludes any |Supvisors| instance that has been added to |Supvisors| in discovery mode.

This option is automatically disabled if the ``supvisors_list`` option is not set or empty.

This option prevails over the ``LIST`` and ``USER`` options if combined with them.

``LIST`` option
***************

When the ``LIST`` option is selected, the synchronization is complete when all known |Supvisors| instances are marked
as ``RUNNING``.
This includes the |Supvisors| instances declared in the ``supvisors_list`` option **AND** the |Supvisors| instances
that has been added to |Supvisors| in discovery mode.

This option prevails over the ``USER`` options if combined with it.

.. attention:

    When used together with the discovery mode, the synchronization may be completed very quickly in the event where
    only the local |Supvisors| instance has been discovered.

``TIMEOUT`` option
******************

It may happen that some declared |Supvisors| instances do not publish (very late starting, no starting at all,
system down, network down, etc).

When the ``TIMEOUT`` option is selected, each |Supvisors| instance waits for ``synchro_timeout`` seconds
to give a chance to all other instances to publish. When this delay is exceeded, all the |Supvisors| instances
that are **not** identified as ``RUNNING`` or ``ISOLATED`` are set to:

    * ``STOPPED`` if `Auto-Fencing`_ is **not** activated;
    * ``ISOLATED`` if `Auto-Fencing`_ is activated.

This option prevails over all other ``synchro_options`` options if combined with them.

``CORE`` option
***************

Another possibility is when it is predictable that some |Supvisors| instances may be started later.
For example, the pool of nodes may include servers that will always be started from the very beginning and consoles
that may be started only on demand.

In this case, it would be a pity to always wait for ``synchro_timeout`` seconds.
That's why the ``core_identifiers`` attribute has been introduced so that the synchronization phase is considered
completed when a subset of the |Supvisors| instances declared in ``supvisors_list`` are ``RUNNING``.

This option is automatically disabled if the ``core_identifiers`` option is not set or empty.

This option prevails over ``LIST`` and ``USER`` options if combined with them.

``USER`` option
***************

This option is useful in a context where |Supvisors| is running in a system made up of many nodes that may be started
on a random basis and where core |Supvisors| instances cannot be easily identified.

When the ``USER`` option is selected, it allows the user to put an end to the synchronization phase when the set
of running |Supvisors| instances is suitable to the user.

This action can be performed through the |Supvisors| ``end_sync`` XML-RPC (via code, ``supervisorctl`` or
the |Supvisors| Web UI).
This XML-RPC has an optional parameter that allows the user to force the selection of the |Supvisors| *Master* instance.
If not set, the default election mechanism applies.


Principles of Master election
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Whatever the number of available |Supvisors| instances when entering the ``ELECTION`` state, |Supvisors| will elect
a *Master* instance among the active |Supvisors| instances.

The |Supvisors| *Master* instance is the only |Supvisors| instance that is allowed to trigger automatic behaviour,
such as:

    * driving the |Supvisors| global state and modes,
    * initial distribution of applications,
    * conciliation in case of process conflicts,
    * re-distribution of applications in case of node loss or process failure.

A few considerations about the |Supvisors| *Master* instance:

    * the *Master* must exist,
    * the *Master* must be unique,
    * the *Master* must be seen as ``RUNNING`` by the local |Supvisors| instance,
    * the *Master* must be seen as ``RUNNING`` by all the remote Supvisors instances seen as ``RUNNING``
      by the local Supvisors instance.

Any violation of these rules will bring back |Supvisors| into the ``ELECTION`` state.

Selecting a *Master* requires some stability within |Supvisors| in order to avoid an inconsistent situation
(*split-brain*), that will be solved at some point at |Supvisors| level, but that may require a lot of unnecessary
conciliation before the *Master* selection converges to a single candidate.

A typical example is when dozens of |Supvisors| instances are started at the same time and configured with a short
``TIMEOUT``. Every |Supvisors| instance may then exit the ``SYNCHRONIZATION`` state with a subset of active |Supvisors|
instances that can differ a lot from one instance to the other. A consequence may be multiple |Supvisors| *Master*
instances being elected, each one trying to distribute the same set of applications until the split-brain is detected
and solved over and over until convergence.

That's why the *Master* election does not take place until all |Supvisors| instances have the same view of each other,
i.e.:

    * all |Supvisors| instances must be seen as either ``STOPPED``, ``ISOLATED`` or ``RUNNING``,
    * in other words, any |Supvisors| instance in a transitional state (``CHECKING``, ``CHECKED`` or ``FAILED``)
      must be waited,
    * the subset of ``RUNNING`` |Supvisors| instances must be identical in all |Supvisors| instances.

This logic is based on the local state and modes shared between all |Supvisors| instances.

By default, the |Supvisors| *Master* instance is the |Supvisors| instance having the smallest nick name among all
the active |Supvisors| instances, unless the option ``core_identifiers`` is used. In the latter case, candidates
are taken from this list in priority.

Once done, |Supvisors| transitions to the ``DISTRIBUTION`` state to start automatically the applications.

.. important:: *About late Supvisors instances*

    Whenever a |Supvisors| instance is started while the others are already in ``OPERATION``, all instances will go
    back temporarily to the ``ELECTION`` state, because the new instance has no *Master*, which breaks the first rule.

    During the hand-shake with the other |Supvisors| instances, the new |Supvisors| instance gets their state and modes,
    including the *Master* identification. In the ``ELECTION`` state, the new |Supvisors| instance adopts the existing
    *Master* too. When the state and modes of all |Supvisors| instances is consistent, the ``ELECTION`` state is exited.


.. _auto_fencing:

Auto-Fencing
------------

Auto-fencing is applied when the ``auto_fence`` option of the :ref:`supvisors_section` is set.
It takes place when one of the |Supvisors| instances is seen as inactive (crash, system power down, network failure)
from the other |Supvisors| instances.

In this case, the running |Supvisors| instances disconnect the corresponding URL from their subscription socket.
The |Supvisors| instance is marked as ``ISOLATED`` and, in accordance with the program rules defined,
|Supvisors| may restart somewhere else the processes that were eventually running in that |Supvisors| instance.

If the incriminated |Supvisors| instance is restarted, the isolation doesn't prevent the new |Supvisors| instance
to receive events from the other instances that have isolated it.
Indeed, it has not been considered so far to filter the subscribers from the *Publish* side.

That's why the hand-shake is performed in :ref:`synchronizing`.
Each newly arrived |Supvisors| instance asks to the others if it has been previously isolated before taking
into account the incoming events.

In the case of a network failure, the same mechanism is of course applied on the other side.
Here comes the premises of a *split-brain syndrome*, as it leads to have 2 separate and identical sets of applications.

If the network failure is fixed, both sets of |Supvisors| are still running but do not communicate between them.

.. attention::

    |Supvisors| does NOT isolate the nodes at the Operating System level, so that when the incriminated nodes
    become active again, it is still possible to perform network requests between all nodes, despite the |Supvisors|
    instances do not communicate anymore.

    Similarly, it is outside the scope of |Supvisors| to isolate the communication at application level.
    It is the user's responsibility to isolate his applications.


.. _extra_arguments:

Extra Arguments
----------------

|Supervisor| users have requested the possibility to add extra arguments to the command line of a program without
having to update and reload the program configuration in |Supervisor|.

    `#1023 - Pass arguments to program when starting a job? <https://github.com/Supervisor/supervisor/issues/1023>`_

Indeed, the applicative context is evolving at runtime and it may be quite useful to give some information
to the new process (options, path, URL of a server, URL of a display, etc), especially when dealing with
distributed applications.

|Supvisors| introduces new XML-RPCs that are capable of taking into account extra arguments that are passed
to the command line before the process is started:

   * ``supvisors.start_args``: start a process in the local |Supvisors| instance;
   * ``supvisors.start_process``: start a process using a starting strategy.

.. note::

    The extra arguments of the program are shared by all |Supvisors| instances.
    Once used, they are published through a |Supvisors| internal event and are stored directly into the |Supervisor|
    internal configuration of the programs.

    In other words, considering 2 |Supvisors| instances A and B, a process that is started in |Supvisors| instance A
    with extra arguments and configured to restart on node crash (refer to `Running Failure strategy`_).
    if the |Supvisors| instance A crashes (or simply becomes unreachable), the process will be restarted in the
    |Supvisors| instance B with the same extra arguments.

.. attention::

    A limitation however: the extra arguments are reset each time a new |Supvisors| instance connects to the other ones,
    either because it has started later or because it has been disconnected for a while due to a network issue.


.. _starting_strategy:

Starting strategy
-----------------

|Supvisors| provides a means to start a process without telling explicitly where it has to be started,
and in accordance with the rules defined for this program.


Choosing a |Supvisors| instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following rules are applicable whatever the chosen strategy:

    * the process must not be already in a *running* state in a broad sense, i.e. ``RUNNING``, ``STARTING``
      or ``BACKOFF`` ;
    * the process must be known to the |Supervisor| of the targeted |Supvisors| instance;
    * the related program must be enabled in the targeted |Supvisors| instance;
    * the targeted |Supvisors| instance must be ``RUNNING``;
    * the targeted |Supvisors| instance must be allowed in the ``identifiers`` rule of the process;
    * the *load* of the targeted node where multiple |Supvisors| instances may be running must not exceed 100%
      when adding the ``expected_loading`` of the program to be started.

The *load* of a |Supvisors| instance is defined as the sum of the ``expected_loading`` of each process running in this
|Supvisors| instance.

The *load* of a node is defined as the sum of the loads of the |Supvisors| instances that are running on this node.

When applying the ``CONFIG`` strategy, |Supvisors| chooses the first |Supvisors| instance available in the
``supvisors_list``.

.. attention::

    Using the ``CONFIG`` strategy when the discovery mode activated will give non-deterministic results
    because the list |Supvisors| instances is built on-the-fly, whereas the list of items in the ``supvisors_list``
    option is fixed.

When applying the ``LESS_LOADED`` strategy, |Supvisors| chooses the |Supvisors| instance in the ``supvisors_list``
having the lowest *load*.
The aim is to distribute the process load among the available |Supvisors| instances.

When applying the ``MOST_LOADED`` strategy, |Supvisors| chooses the |Supvisors| instance in the ``supvisors_list``
having the greatest *load*.
The aim is to maximize the loading of a |Supvisors| instance before starting to load another |Supvisors| instance.
This strategy is more interesting when the resources are limited.

When applying the ``LESS_LOADED_NODE`` strategy, |Supvisors| chooses the |Supvisors| instance in the ``supvisors_list``
having the lowest *load* on the node having the lowest *load*.

When applying the ``MOST_LOADED_NODE`` strategy, |Supvisors| chooses the |Supvisors| instance in the ``supvisors_list``
having the greatest *load* on the node having the greatest *load*.

.. note::

    When a single |Supvisors| instance is running on each node, ``LESS_LOADED_NODE`` and ``MOST_LOADED_NODE`` are
    strictly equivalent to ``LESS_LOADED`` and ``MOST_LOADED``.

.. attention::

    The use of ``LESS_LOADED_NODE`` and ``MOST_LOADED_NODE`` rely on |Supvisors| being able to detect
    that |Supvisors| instances are running on the same node. To that end, it uses the hardware address returned
    by the ``uuid.getnode()`` function. |br|
    In the event where the |Supvisors| instance is running in a Docker container, a particular attention must be taken
    to ensure that the Docker container is configured so that the ``uuid.getnode()`` function returns the right value.

When applying the ``LOCAL`` strategy, |Supvisors| chooses the local |Supvisors| instance.
A typical use case is to start an HCI application on a given console, while other applications / services may be
distributed over other nodes.

.. attention::

    A consequence of choosing the ``LOCAL`` strategy as the default ``starting_strategy``
    in the :ref:`supvisors_section` is that all programs will be started on the |Supvisors| *Master* instance.

Starting a process
~~~~~~~~~~~~~~~~~~

The internal *Starter* of |Supvisors| applies the following logic to start a process:

| if the process is stopped:
|     choose a |Supvisors| instance for the process in accordance with the rules defined in the previous section
|     perform a ``supvisors.start_args(namespec)`` XML-RPC to the chosen |Supvisors| instance
|

This single job is considered completed when:

    * a ``RUNNING`` event is received and the ``wait_exit`` rule is **not** set for this process;
    * an ``EXITED`` event is received with an expected exit code and the ``wait_exit`` rule is set for this process;
    * an error is encountered (``FATAL`` event, ``EXITED`` event with an unexpected exit code);
    * no ``STARTING`` event has been received 2 ticks after the XML-RPC;
    * no ``RUNNING`` event has been received X+2 ticks after the XML-RPC, X corresponding to the number of ticks needed
      to cover the ``startsecs`` seconds of the program definition in the |Supvisors| instance where the process
      has been requested to start.

This principle is used for starting a single process using a ``supvisors.start_process`` XML-RPC.

.. attention:: *About using the wait_exit rule*

    If the process is expected to exit and does not exit, it will block the *Starter* until |Supvisors| is restarted.


Starting an application
~~~~~~~~~~~~~~~~~~~~~~~

The application start sequence is re-evaluated every time a new |Supvisors| instance becomes active in |Supvisors|.
Indeed, as explained above, the internal data structure is updated with the programs configured in the new |Supervisor|
instance and this may have an impact on the application start sequence.

The start sequence corresponds to a dictionary where:

    * the keys correspond to the list of ``start_sequence`` values defined in the program rules of the application ;
    * the value associated to a key contains the list of programs having this key as ``start_sequence``.

.. hint::

    The logic applied here is an answer to the following |Supervisor| unresolved issues:

        * `#122 - supervisord Starts All Processes at the Same Time <https://github.com/Supervisor/supervisor/issues/122>`_
        * `#456 - Add the ability to set different "restart policies" on process workers <https://github.com/Supervisor/supervisor/issues/456>`_

.. important::

    Only the *Managed* applications can have a start sequence, i.e. only those that are declared in the |Supvisors|
    :ref:`rules_file`.

    The programs having a ``start_sequence`` lower or equal to 0 are not considered in the start sequence, as they are
    not meant to be automatically started.

The internal *Starter* of |Supvisors| applies the following principle to start an application:

| while application start sequence is not empty:
|     pop the process list having the lower (strictly positive) ``start_sequence``
|
|     for each process in process list:
|         apply `Starting a process`_
|
|     wait for the jobs to complete
|

This principle is used for starting a single application using a ``supvisors.start_application`` XML-RPC.


Starting all applications
~~~~~~~~~~~~~~~~~~~~~~~~~

When entering the ``DISTRIBUTION`` state, all |Supvisors| instances evaluate the global start sequence using
the ``start_sequence`` rule configured for the applications and processes.

The global start sequence corresponds to a dictionary where:

    * the keys correspond to the list of ``start_sequence`` values defined in the application rules;
    * the value associated to a key is the list of application start sequences whose applications have this key
      as ``start_sequence``.

The |Supvisors| *Master* instance starts the applications using the global start sequence.
The following pseudo-code explains the logic used:

| while global start sequence is not empty:
|     pop the application list having the lower (strictly positive) ``start_sequence``
|
|     for each application in application list:
|         apply `Starting an application`_
|
|     wait for the jobs to complete
|

.. note::

    The applications having a ``start_sequence`` lower or equal to 0 are not considered, as they are not meant to be
    automatically started.

.. important::

    When leaving the ``DISTRIBUTION`` state, it may happen that some applications are not started properly
    due to missing relevant |Supvisors| instances.

    When a |Supvisors| instance is started later and is authorized in the |Supvisors| ensemble, |Supvisors| transitions
    back to the ``DISTRIBUTION`` state and tries to **repair** such applications.
    The applications are **not** restarted. Only the stopped processes are considered.

    May the new |Supvisors| instance arrive during a ``DISTRIBUTION`` or ``CONCILIATION`` phase, the transition to the
    ``DISTRIBUTION`` state is deferred until the current distribution or conciliation jobs are completed.
    It has been chosen NOT to transition back to the ``INITIALIZATION`` state to avoid a new synchronization phase.


.. _starting_failure_strategy:

Starting Failure strategy
-------------------------

When an application is starting, it may happen that any of its programs cannot be started due to various reasons:

    * the program command line is wrong;
    * third parties are missing;
    * none of the |Supvisors| instances defined in the ``identifiers`` of the program rules are started;
    * the applicable |Supvisors| instances are already too much loaded;
    * etc.

|Supvisors| uses the ``starting_failure_strategy`` option of the rules file to determine the behavior to apply
when a ``required`` process cannot be started. Programs having the ``required`` set to False are not considered as
their absence is minor by definition.

Possible values are:

    * ``ABORT``: Abort the application starting;
    * ``STOP``: Stop the application;
    * ``CONTINUE``: Skip the failure and continue the application starting.


.. _running_failure_strategy:

Running Failure strategy
------------------------

The ``autorestart`` option of |Supervisor| may be used to restart automatically a process that has crashed
or has exited unexpectedly (or not). However, when the node itself crashes or becomes unreachable,
the other |Supervisor| instances cannot do anything about that.

|Supvisors| uses the ``running_failure_strategy`` option of the rules file to warm restart a process that was
running on a node that has crashed, in accordance with the default ``starting_strategy`` set in the
:ref:`supvisors_section` and with the ``supvisors_list`` program rules set in the :ref:`rules_file`.

This option can be also used to stop or restart the whole application after a process crash. Indeed, it may happen
that some applications cannot survive if one of their processes is just restarted.

Possible values are:

    * ``CONTINUE``: Skip the failure and the application keeps running;
    * ``RESTART_PROCESS``: Restart the lost process on another |Supvisors| instance;
    * ``STOP_APPLICATION``: Stop the application;
    * ``RESTART_APPLICATION``: Restart the application;
    * ``SHUTDOWN``: Shutdown |Supvisors| (i.e. all |Supvisors| instances);
    * ``RESTART``: Restart |Supvisors| (i.e. all |Supvisors| instances).

.. important::

    The ``RESTART_PROCESS`` is NOT intended to replace the |Supervisor| ``autorestart`` for the local |Supvisors|
    instance.
    Provided a program definition where ``autorestart`` is set to ``false`` in the |Supervisor| configuration
    and where the ``running_failure_strategy`` option is set to ``RESTART_PROCESS`` in the |Supvisors| rules file,
    if the process crashes, |Supvisors| will NOT restart the process.

.. note::

    Given that this option is set on the program rules, program strategies within an application may be incompatible
    in the event of multiple failures. That's why priorities have been set on this strategy.
    ``STOP_APPLICATION`` supersedes ``RESTART_APPLICATION``, which itself supersedes ``RESTART_PROCESS`` and finally
    ``CONTINUE``. So if a program with the ``RESTART_APPLICATION`` option fails at the same time that a program
    of the same application with the ``STOP_APPLICATION`` option, only the ``STOP_APPLICATION`` will be applied.

    When the ``RESTART_PROCESS`` strategy is evaluated, if the application is fully stopped - supposedly because of the
    failure -, |Supvisors| will promote the ``RESTART_PROCESS`` into ``RESTART_APPLICATION``. The idea is to benefit
    from a full start sequence at application level rather than uncorrelated program restarts in the event of multiple
    failures within the same application.

.. hint::

   The ``STOP_APPLICATION`` strategy provides an answer to the following |Supervisor| request:

      * `#874 - Bring down one process when other process gets killed in a group <https://github.com/Supervisor/supervisor/issues/874>`_

.. hint::

   The ``SHUTDOWN`` strategy provides an answer to the following |Supervisor| request:

      * `#712 - shutdown supervisord once one of the programs is killed <https://github.com/Supervisor/supervisor/issues/712>`_


.. _stopping_strategy:

Stopping strategy
-----------------

|Supvisors| provides a means to stop a process without telling explicitly where it is running.


Stopping a process
~~~~~~~~~~~~~~~~~~

The internal *Stopper* of |Supvisors| applies the following logic to stop a process:

| if the process is running:
|     perform a ``supervisor.stopProcess(namespec)`` XML-RPC to the |Supervisor| instances where the process is running
|

This single job is considered completed when:

    * a ``STOPPED`` event is received for this process;
    * an error is encountered (``FATAL`` event, ``EXITED`` event whatever the exit code);
    * no ``STOPPING`` event has been received 2 ticks after the XML-RPC;
    * no ``STOPPED`` event has been received X+2 ticks after the XML-RPC, X corresponding to the number of ticks needed
      to cover the ``stopwaitsecs`` seconds of the program definition in the |Supvisors| instance where the process
      has been requested to stop.

This principle is used for stopping a single process using a ``supvisors.stop_process`` XML-RPC.


Stopping an application
~~~~~~~~~~~~~~~~~~~~~~~

The application stop sequence is defined at the same moment than the application start sequence.
It corresponds to a dictionary where:

    * the keys correspond to the list of ``stop_sequence`` values defined in the program rules of the application ;
    * the value associated to a key is the list of programs having this key as ``stop_sequence``.

.. note::

    The *Unmanaged* applications do have a stop sequence. All their programs have the default ``stop_sequence``
    set to ``0``.

.. hint::

    The logic applied here is an answer to the following |Supervisor| unresolved issue:

        * `#520 - allow a program to wait for another to stop before being stopped? <https://github.com/Supervisor/supervisor/issues/520>`_

.. hint::

    All the programs sharing the same ``stop_sequence`` are stopped simultaneously, which solves some of the requests
    described in the following |Supervisor| unresolved issue:

        * `#723 - Restart waits for all processes to stop before starting any <https://github.com/Supervisor/supervisor/issues/723>`_

The internal *Stopper* of |Supvisors| applies the following algorithm to stop an application:

| while application stop sequence is not empty:
|     pop the process list having the greater ``stop_sequence``
|
|     for each process in process list:
|         apply `Stopping a process`_
|
|     wait for the jobs to complete
|

This principle is used for stopping a single application using a ``supvisors.stop_application`` XML-RPC.


Stopping all applications
~~~~~~~~~~~~~~~~~~~~~~~~~

The applications are stopped when |Supvisors| is requested to restart or shut down.

When entering the ``DISTRIBUTION`` state, each |Supvisors| instance evaluates also the global stop sequence
using the ``stop_sequence`` rule configured for the applications and processes.

The global stop sequence corresponds to a dictionary where:

    * the keys correspond to the list of ``stop_sequence`` values defined in the application rules;
    * the value associated to a key is the list of application stop sequences whose applications have this key
      as ``stop_sequence``.

Upon reception of the ``supvisors.restart`` or ``supvisors.shutdown``, the |Supvisors| instance uses
the global stop sequence to stop all the running applications in the defined order.
The following pseudo-code explains the logic used:

| while global stop sequence is not empty:
|     pop the application list having the greater ``stop_sequence``
|
|     for each application in application list:
|         apply `Stopping an application`_
|
|     wait for the jobs to complete
|

.. _conciliation:

Conciliation
------------

|Supvisors| is designed so that there should be only one instance of the same process running on a set of nodes,
although all of them may have the capability to start it.

Nevertheless, it is still likely to happen in a few cases:

    * using a request to |Supervisor| itself (through Web UI, :program:`supervisorctl`, XML-RPC);
    * upon a network failure.

.. attention::

    In the event of a network failure - let's say a network cable is unplugged -, if the ``auto_fence`` option is not
    set, a |Supvisors| instance running on the isolated node will be set to ``STOPPED`` instead of ``ISOLATED`` and its
    URL will not disconnected from the subscriber socket.

    Depending on the rules set, this situation may lead |Supvisors| to warm restart the processes that were running in
    the lost |Supvisors| instance onto other |Supvisors| instances.

    When the network failure is fixed, |Supvisors| will likely have to deal with a bunch of duplicated applications
    and processes.

When such a conflict is detected, |Supvisors| enters in the ``CONCILIATION`` state.
Depending on the ``conciliation_strategy`` option set in the :ref:`supvisors_section`, it applies a strategy to be rid
of all duplicates:

``SENICIDE``

    When applying the ``SENICIDE`` strategy, |Supvisors| keeps the youngest process, i.e. the process that has been
    started the most recently, and stops all the others.

``INFANTICIDE``

    When applying the ``INFANTICIDE`` strategy, |Supvisors| keeps the oldest process and stops all the others.

``USER``

    That's the easy one. When applying the ``USER`` strategy, |Supvisors| just waits for a third party
    to solve the conflicts using Web UI, :program:`supervisorctl`, XML-RPC, process signals, or any other solution.

``STOP``

    When applying the ``STOP`` strategy, |Supvisors| stops all conflicting processes, which may lead
    the corresponding applications to a degraded state.

``RESTART``

    When applying the ``RESTART`` strategy, |Supvisors| stops all conflicting processes and restarts a new one.

``RUNNING_FAILURE``

    When applying the ``RUNNING_FAILURE`` strategy, |Supvisors| stops all conflicting processes and deals
    with the conflict as it would deal with a running failure, depending on the strategy defined for the process.
    So, after the conflicting processes are all stopped, |Supvisors| may restart the process, stop the application,
    restart the application or do nothing at all.

|Supvisors| leaves the ``CONCILIATION`` state when all conflicts are conciliated.

.. include:: common.rst