From 2256ce452780b305216ff98793437559bff929d1 Mon Sep 17 00:00:00 2001 From: Aurélien Bompard Date: Mar 11 2020 16:56:52 +0000 Subject: Add notes to fix network partitions on RabbitMQ Signed-off-by: Aurélien Bompard --- diff --git a/docs/sysadmin-guide/sops/rabbitmq.rst b/docs/sysadmin-guide/sops/rabbitmq.rst index c8ec977..656b675 100644 --- a/docs/sysadmin-guide/sops/rabbitmq.rst +++ b/docs/sysadmin-guide/sops/rabbitmq.rst @@ -93,6 +93,55 @@ RabbitMQ offers a CLI, rabbitmqctl, which you can use on any node in the cluster. It also offers a web interface for management and monitoring, but that is not currently configured. +Network Partition +----------------- + +In case of network partitions, the RabbitMQ cluster should handle it and +recover on its own. In case it doesn't when the network situation is fixed, the +partition can be diagnosed with ``rabbitmqctl cluster_status``. It should +include the line ``{partitions,[]},`` (empty array). + +If the array is not empty, the first nodes in the array can be restartedi one +by one, but make sure you give them plenty of time to sync messages after +restart (this can be watched in the ``/var/log/rabbitmq/rabbit.log`` file) + +Federation Status +----------------- + +Federation is the process of copying messages from the internal ``/pubsub`` +vhost to the external ``/public_pubsub`` vhost. During network partitions, it +has been seen that the Federation relaying process does not come back up. The +federation status can be checked with the command ``rabbitmqctl eval +'rabbit_federation_status:status().'`` on ``rabbitmq01``. It should not return +the empty array (``[]``) but something like:: + + [[{exchange,<<"amq.topic">>}, + {upstream_exchange,<<"amq.topic">>}, + {type,exchange}, + {vhost,<<"/public_pubsub">>}, + {upstream,<<"pubsub-to-public_pubsub">>}, + {id,<<"b40208be0a999cc93a78eb9e41531618f96d4cb2">>}, + {status,running}, + {local_connection,<<"">>}, + {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>}, + {timestamp,{{2020,3,11},{16,45,18}}}], + [{exchange,<<"zmq.topic">>}, + {upstream_exchange,<<"zmq.topic">>}, + {type,exchange}, + {vhost,<<"/public_pubsub">>}, + {upstream,<<"pubsub-to-public_pubsub">>}, + {id,<<"c1e7747425938349520c60dda5671b2758e210b8">>}, + {status,running}, + {local_connection,<<"">>}, + {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>}, + {timestamp,{{2020,3,11},{16,45,17}}}]] + +If the empty array is returned, the following command will restart the federation (again on ``rabbitmq01``):: + + rabbitmqctl clear_policy -p /public_pubsub pubsub-to-public_pubsub + rabbitmqctl set_policy -p /public_pubsub --apply-to exchanges pubsub-to-public_pubsub "^(amq|zmq)\.topic$" '{"federation-upstream":"pubsub-to-public_pubsub"}' + +After which the Federation link status can be checked with the same command as before. .. RabbitMQ:: https://www.rabbitmq.com/ .. clustering:: https://www.rabbitmq.com/clustering.html