#30 Load/Stress Test Taskotron Infrastructure
Closed: Fixed None Opened 9 years ago by tflink.

Before we deploy to production, it would be nice to have some load/stress testing of a production-ish environment so that we can be more certain of our setup before pushing the "go" button.

One of the main things I'm concerned about is how well a single master will cope with the load of log reads on top of coordinating slave work. However, after looking at the AutoQA logs, I realized that there are not many people looking at the logs (~ 4-5 hits to any log per day) and I don't think it'll be a large source of load for the near future.

The things I'm interested in are:
latency issues for scheduling tasks
completion time for tasks
* cpu, memory and disk load on the master

== Components ==
csv files of triggers to run for the various scenarios listed
measurement of latency for hitting the waterfall page
system performance measurements (done for testing purposes)
measure buildbot metrics over time using the json interface
* load generation that fetches pseudo-random log pages from buildbot (optional)

The process will be:
start measurements and load generation
run trigger on the csv events
measure time at which all jobs are completed
wash, rinse, repeat

== Situations ==

The following are initial situations to run for load/stress testing

=== Long Running Tasks===
This situation is when many long running tasks are submitted at the same time. The experiment here is to see how the master copes with longer queues and long running tasks.

The recently created buildsrpm task ({this ticket}) will work well for this because it doesn't depend on state.

Start with one day's worth of successful koji builds and go from there.

=== Short Tasks ===
This situation is when a large number of short-running tasks are submitted. The experiment is to see how many small tasks it takes to cause problems in the master or at least to know that we have significant headroom given current rates of task execution.

Rpmlint is a good choice for this because it is generally quick and doesn't depend on infrastructure state.

Start with one day's worth of koji builds and scale up to at least a week's worth.

=== Combination ===
This is a mixture of long running and short running tasks. A mixture of 30% long running and 70% short-running seems appropriate at first glance but that ratio may need to change.


I've done some basic load testing of a taskotron instance.

I put some artificial load on the master (pulling pseudo-random log files ~ 1/10s) and submitted 600 rpmbuild requests over ~ 5 minutes. The master was rather slow to respond to requests when all the jobs were being submitted but it didn't break and continued to work over the entire load period.

I haven't done much with the results yet and haven't done all the scenarios listed above but I think that it's enough for iteration 4. I can make the raw data and/or scripts available if anyone is interested.

our production instance has been holding up fine so far and none of the older load testing showed issues in our setup.

Closing for now, can re-visit at a later date if needed

Metadata Update from @tflink:
- Issue tagged with: infrastructure

6 years ago

Login to comment on this ticket.

Metadata