humaton / fedora-infra / ansible

Forked from fedora-infra/ansible 4 years ago
Clone

7e18f04 rabbitmq: adjust things to avoid messy partitions

Authored and Committed by kevin 5 years ago
    rabbitmq: adjust things to avoid messy partitions
    
    We have been having the cluster fall over for still unknown reasons,
    but this patch should at least help prevent them:
    
    first we increase the net_ticktime parameter from it's default of 60 to 120.
    rabbitmq sends 4 'ticks' to other cluster members over this time and if 25%
    of them are lost it assumes that cluster member is down. All these vm's are
    on the same net and in the same datacenter, but perhaps heavy load
    from other vm's causes them to sometimes not get a tick in time?
    http://www.rabbitmq.com/nettick.html
    
    Also, set our partitioning strategy to autoheal. Currently if some cluster
    member gets booted out, it gets paused, and stops processing at all.
    With autoheal it will try and figure out a 'winning' partition and restart
    all the nodes that are not in that partition.
    https://www.rabbitmq.com/partitions.html
    
    Hopefully the first thing will make partitions less likely and the second
    will make them repair without causing massive pain to the cluster.
    
    Signed-off-by: Kevin Fenzi <kevin@scrye.com>