#28 investigate buildbot multimaster
Closed: Fixed None Opened 9 years ago by tflink.

Before we start running a production service, we need to investigate resiliency and redundancy in our execution platform. This means a multimaster deployment in buildbot terms.

Investigate the ways we could have a multimaster deployment, the advantages of each method and the infrastructure requirements for that deployment


I have a multimaster system running locally and while it does seem to be functioning correctly, it adds some complications and brings up some questions.

The biggest complication comes in having multiple systems instead of one. Upstream's suggested approach is to use one machine as the scheduler and the other machines to interface with the buildslaves. Using only 2 machines (1 scheduler, 1 slave interface), this isn't a huge issue but since the jobs are run through the interface machines you end up with URLs to jobs and logs that change depending on which interface got the job. The slaves only communicate with one master, so it's possible to configure things such that proper links are displayed in resultsdb but it becomes more difficult to browse results and makes the current UUID of (builder, build number) into (master, build, build number) and could cause confusion.

Do we really need multiple masters right now? I'm not sure how to quantitatively answer this without some stress testing and/or benchmarks but I'm not sure we really need multiple slave interfacing masters at this time. That being said, I do see some wisdom in splitting off the scheduler so that if we do run into load issues, we can add another slave interface easily. This would also allow us to delay the question of how to track jobs better than a complicated two or three item UUID.

I would like to do some stress testing before we deploy to production - something like doing a mass rebuild while a load tester hits the frontend. While not required, it would hopefully give us an idea where the bottlenecks are and how far we can scale without adding multiple slave interfaces.

Another variable is that I want to try using buildbot's latent slave feature - it's possible to have buildbot spawn up a new VM through libvirt for each job. This could increase the load on the slave interface and would be worth exploring before we do much load testing.

On a side note, this forced me to find a new way to do triggering. I spent a bit of time and came up with a different method of scheduling jobs that doesn't rely on the force build form, only needs 1 post instead of login/post/logout and can be hidden from public access. Whichever way we go, I'd like to move towards that scheduling method.

I'm closing this for now. Until we have better capabilities for task execution tracking (execdb), this isn't really practical because we don't have a priori knowledge of which buildmaster actually handles the execution and thus, what the link to the job actually is.

Once we have better tracking in place, we can re-open this ticket or create a new one

Metadata Update from @tflink:
- Issue tagged with: infrastructure

6 years ago

Login to comment on this ticket.

Metadata