Issue #3520: Scaling up fedmsg gateway - fedora-infrastructure

fedora-infrastructure

#3520 Scaling up fedmsg gateway

Closed: Fixed None Opened 11 years ago by ralph.

A concern I have is that once we announce fedmsg (and fedmsg-notify) to the world, that we'll get swamped with connections and be unable to serve the data.

To figure out what we could handle, I wrote [https://github.com/ralphbean/fedmsg/tree/develop/extras/stress a script] to make a bunch of simultaneous requests for zeromq subscriptions to hub.fedoraproject.org:9940.

I first found that our setup maxxed out at around 550 connections, after which point some clients would not receive some messages (they were dropped) and if I tried many more connections, the fedmsg-gateway daemon on busgateway01 would crash complaining about "too many open files".

With lmacken's help, I edited /etc/security/limits.conf on busgateway01 and bumped the hard and soft limits for the number of open file descriptors for the fedmsg user from the default of 1024 up to 160000 (near, but not yet equal to the fs.file-limit). I also had to bump the zmq.HWM (high water mark) setting to a huge number so that it could queue at least one message for each of the many subscribing connections.

After that, my script was able to get messages consistently for up to around 1700 clients.

At this point, I think I was reaching a limit with haproxy which has a limit of 1024 connections per proxy-host. In theory we have 8(?) proxy hosts, and each has a max simultaneous connections setting of 1024 which should mean we can handle 8192 simultaneous clients.

However, I think round-robin DNS wasn't distributing the clients "equitably" and some were getting rejected by a handful of the overloaded proxies. I checked the haproxy status page and 2 or 3 of our 8(?) proxies reported they had served 1024 simultaneous connections (their max) while others reported much lower numbers.

DNS-aside, my bet is that we can bump the max-connections setting in haproxy and achieve better results. This will require 1) bumping the haproxy limit in puppet (haproxy.cfg) and 2) bumping the limit for the user that haproxy runs as in /etc/security/limits.conf.

That is a little scary, though. We could initially try just doubling this to 2048 and see what kind of increase in performance we get. Thoughts?

kevin commented 11 years ago

yeah, the reason for the few proxies is that we do seperate them out by geoip region... so you probibly hit the ones that we have set for NA (North America). The other regions would have others, etc.

I'm ok with trying this... can we try in stg first?

Do we have any way to notify the end user that there is an overload going on? Or it just never sees the messages?

At least we should have something in place to let us know so we know the service is oversubscribed...

ralph commented 11 years ago

I was able to get this up to 13k concurrent clients in staging. :)

Toshio reports that there are 7.4k active fas accounts in cla_done and 1k in packager. I don't think we'll have much problem serving the user base if we move these changes into production.

There's no good way to notify the end user of an overload at this point. We can probably write a nagios check to ask haproxy about over-subscription. I'll look into this.

ralph commented 11 years ago

Nagios check is in place now.

Still need to migrate the maxconn bump to production.

ralph commented 11 years ago

Done.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#3520 Scaling up fedmsg gateway Closed: Fixed None Opened 11 years ago by ralph.

Metadata

#3520 Scaling up fedmsg gateway

Closed: Fixed None Opened 11 years ago by ralph.