Issue #97: HTTP 504 Gateway Timeout for ResultsDB calls - taskotron

taskotron

#97 HTTP 504 Gateway Timeout for ResultsDB calls

Closed: Fixed None Opened 9 years ago by kparal.

We're seeing a lot of ResultsDB calls time out today:
https://taskotron.stg.fedoraproject.org/taskmaster/builders/x86_64/builds/100935/steps/runtask/logs/taskotron.log
https://taskotron.stg.fedoraproject.org/taskmaster/builders/x86_64/builds/100961/steps/runtask/logs/taskotron.log
https://taskotron.stg.fedoraproject.org/taskmaster/builders/x86_64/builds/100969/steps/runtask/logs/taskotron.log
https://taskotron.stg.fedoraproject.org/taskmaster/builders/x86_64/builds/100979/steps/runtask/logs/taskotron.log

[libtaskotron:resultsdb_directive.py:189] 2015-04-02 08:55:32 INFO    Posting 115 results to ResultsDB...
[resultsdb_api:resultsdb_api.py:55] 2015-04-02 08:57:36 WARNING Received HTTP failure status code 504 for request: http://resultsdb-stg01.qa.fedoraproject.org/resultsdb_api/api/v1.0//jobs/112663
[resultsdb_api:resultsdb_api.py:59] 2015-04-02 08:57:36 DEBUG   Received invalid JSON data: Expecting value: line 1 column 1 (char 0)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>504 Gateway Timeout</title>
</head><body>
<h1>Gateway Timeout</h1>
<p>The gateway did not receive a timely response
from the upstream server or application.</p>
</body></html>

It seems that ResultsDB is very slow to respond, even when tried manually:
https://taskotron.stg.fedoraproject.org/resultsdb_api/api/v1.0/jobs/112690
(times out for me as well, at the moment)

The question is whether this is some issue in ResultsDB or our infrastructure. @jskladan claims that those queries should be fast, because they are indexed. Maybe this is related to recent system upgrade to Fedora 21? Can somebody check the system load (CPU and I/O)?

somebody commented 9 years ago

This ticket had assigned some Differential requests:
D338

jskladan commented 9 years ago

I would really need to see some profiling from Postgres to be sure, but as much as I can tell from my local setup, querying this specific job should make 120 selects (job, testcase, results, 117x results_data), and all the selects are on primary or foreign keys - namely job.id, result.job_id, testcase.id, and result_data.result_id. I'm the last one to say that this could not be made more efficient, but the time it takes to finish seems a bit over the top. 120 select-over-index queries should not IMHO take this long even in the worst of cases.

tflink commented 9 years ago

I'm watching the load on taskotron-stg01.qa, resultsdb-stg01.qa and our db server.

This is about as un-scientific as it gets, but the load on taskotron-stg01.qa and resultsdb-stg01.qa are both minimal. db-qa01.qa shoots up from time to time (low 0.04, high 2.6), so I'm thinking that this is an issue with an inefficient query or an untuned database. the load seemed to spike a lot when I loaded resultsdb-fe on prod and stg at the same time but I'm not putting too much faith in that observation :)

I'm going to search through the system stats that infra keeps to see if I can find something better but at the moment, it looks like either an inefficient query, a db that needs to be tuned or a db server that needs more resources.

Another thing to keep in mind is that the stg db (id 112868) is larger than either prod (67707) or dev (5072).

tflink commented 9 years ago

It turns out that none of the perf data collection systems are working on the db server (or any of the qa hosts, for that matter). I'm not really sure why, but I don't see that being fixed in the middle of beta freeze.

Since I can't poke at the production db host much without disrupting things, I'm going to see if I can reproduce this locally using a stg db snapshot and VMs. I have a bad feeling that this is going to take a while but let's see if it's reproducible outside our production environment.

tflink commented 9 years ago

I tried reproducing this locally with no luck - the queries during create_result were taking about 250ms but that's not bad enough to kill anything.

Looking closer at the logs on resultsdb-stg01.qa, it looks like the wsgi process is taking too long to respond and giving the 504 to resultsdb_api. All of the failures I'm seeing are in stg, upgradepath on f22-updates-pending.

There's almost no logging in the api part of resultsdb, so it's difficult to tell where the create_result() is going wrong. I've hacked some print statements into the code running on resultsdb-stg01.qa for now - hopefully we'll know more about what part of the process is going wrong when it fails again.

tflink commented 9 years ago

After adding the debug messages, it seems that the wsgi timeout and the subsequent 504 happens during this line:
https://bitbucket.org/fedoraqa/resultsdb/src/130792a31a80/resultsdb/controllers/api_v1.py#cl-494

I'm not sure I understand why the result is being added to the session after the session has already been committed. Then again, I don't understand why this is only happening on stg and not anywhere else. @jskladan - this seems to have been there since the first commits of resultsdb, do you have any recollection of why it's being done?

tflink commented 9 years ago

After much effort, I was finally able to get the same error on my local setup. I'm not 100% certain that the cause is the same, but I don't see any obvious reasons why it wouldn't be

At this point, I'm suspecting it's a combination of a couple things:
1. our db server could use some more cpu - the load factor was often over 2
1. the virthost that is holding our db server has been doing a vraid rebuild and I suspect that may be slowing down IO

The way I was able to reproduce the issue locally was to pound the DB server until it slowed down. I needed more than a single upgradepath with 190 results to do it, but my local system isn't in the middle of a raid rebuild.

If I'm right, the problem should go away once the virthost's array is done rebuilding (currently estimates 400min until completion). We could add more cpu to the db server, but that'll require a freeze break to change the db-qa01.qa VM and bringing down dev, stg and prod while the db server is rebooted. With jobs only failing in stg right now and only the really large reporting jobs (f22-updates upgradepath seems to be the only one that's failing with timeout errors), I think we can wait until beta freeze is over.

That being said, I'd really like to see better logging (especially db error logging) in resultsdb. I suspect that'll be a bit tricky with flask-restful but I also think it'd be worth a shot for the next time we run into an issue like this. We may also want to look into tuning our postgres config but I'm not sure if that'd provide enough benefit to justify someone sitting down for a week or two to figure out good benchmarks, start tweaking the config and make sure it works in a pseudo-production capacity.

jskladan commented 9 years ago

! In #452#5404, @tflink wrote:
I'm not sure I understand why the result is being added to the session after the session has already been committed. Then again, I don't understand why this is only happening on stg and not anywhere else. @jskladan - this seems to have been there since the first commits of resultsdb, do you have any recollection of why it's being done?

Not sure if this is still true (or the best way to do it), but the line (and others like this throughout the code) made sure that the object properties like id are populated with the right values, after the add()/commit() call. I'll try it out again, to be sure this is still needed, or we can try to use session.refresh($ITEM$) instead. Note that inserted_primary_key method does not really do what we need, as the primary key is not the only un-set thing after the add/commit, especially with objects with relations.

I'm really do not know, though, why this operation takes so much time. I'll try and dig into the docs, to get an answer, but I'm not sure should it take that much time, if the worst-case scenario would be running a select for that particular object.

jskladan commented 9 years ago

! In #452#5405, @tflink wrote:
After much effort, I was finally able to get the same error on my local setup. I'm not 100% certain that the cause is the same, but I don't see any obvious reasons why it wouldn't be

Thanks!

That being said, I'd really like to see better logging (especially db error logging) in resultsdb. I suspect that'll be a bit tricky with flask-restful but I also think it'd be worth a shot for the next time we run into an issue like this. We may also want to look into tuning our postgres config but I'm not sure if that'd provide enough benefit to justify someone sitting down for a week or two to figure out good benchmarks, start tweaking the config and make sure it works in a pseudo-production capacity.

OK, I'll get onto it. I did not really worry that much about it, since I once again fell into the "it's temporary, and proof of concept anyway" trap. I'll try to work out some reasonable logging, but I'm not sure, at the moment if there is a reasonable way to log the long-lasting queries (which is what would be the thing here, probably) in the "userspace" from inside of the program. But it might be well worth setting up the Postgres' internal logging or a perf tool for analysing difficult queries like https://wiki.postgresql.org/wiki/Logging_Difficult_Queries

tflink commented 9 years ago

So much for the "it's slow because the virthost's RAID array is rebuilding" theory. There are still stg failures with the same symptoms even after the rebuild completed.

I've submitted a freeze break request to enable long query logging on the db host. If that's approved, we'll have better logging on the db server for long queries. I'm also going to see if we can bump the cpu resources without rebuilding the db host.

tflink commented 9 years ago

While I was waiting for +1s from infra for the freeze break, I was digging more into making sure the bottom threshold for logging slow queries was low enough and I stumbled upon something interesting.

To generate load on my local instance, I've written a script that submits results for a large upgradepath job (198 results). When I started timing details to the script, I found that it's not the submission of individual results that's causing the problem - it's updating the job post-execution (specifically, resultsdb_api.update_job(jobid, status='COMPLETED')) that's hitting the timeout most often.

This doesn't explain why the failed jobs in resultsdb-stg don't contain all the results they should but at least it's another piece of data.

tflink commented 9 years ago

! In #452#5426, @tflink wrote:
To generate load on my local instance, I've written a script that submits results for a large upgradepath job (198 results). When I started timing details to the script, I found that it's not the submission of individual results that's causing the problem - it's updating the job post-execution (specifically, resultsdb_api.update_job(jobid, status='COMPLETED')) that's hitting the timeout most often.

Another piece of data - I added some logging messages to my local resultsdb and the culprit has been the final add and commit in api_v1.py:update_job(job_id) so I'm going to keep digging into that for now.

tflink commented 9 years ago

I'm not sure this is the cause of the problems in stg, but it may be related. Turns out that when commiting the job update in update_job(), sqlalchemy does a SELECT for each and every result linked to that job. Some of these larger upgradepath jobs have ~200 results and from the logging I've been doing locally and these SELECTs are taking quite a bit of time (240+ ms on my local setup).

I don't understand why this started happening with the last revision, though. An index was added for the keys and values in result_data but that should be speeding things up, not slowing things down. I'm going to keep poking at this locally - I have the feeling that I'm still missing something here, so expect more pseudo-stream-of-consciousness updates :)

tflink commented 9 years ago

I'm still fumbling around with this a bit, but found something interesting. On my local setup (f21 vm all-in-one for resultsdb, el7 vm for postgres with mostly infra playbook deployment), I looked an example of the SELECT query (jobid is a made up number that both stg and prod snapshots had) that sqla is running n times for the update_job. Running the same query from psql cli on resultsdb (snapshot from last week) and resultsdb_stg (snapshot from last week, many results added during testing), I found the following:

Using resultsdb database

[2015-04-06 20:31:37.933 GMT-[local]] LOG: duration: 203.034 ms statement: SELECT result_data.id AS result_data_id, result_data.result_id AS result_data_result_id, result_data.key AS result_data_key, result_data.value AS result_data_value FROM result_data WHERE 1436 = result_data.result_id;

Using resultsdb_stg database

[2015-04-06 20:32:52.168 GMT-[local]] LOG: duration: 315.782 ms statement: SELECT result_data.id AS result_data_id, result_data.result_id AS result_data_result_id, result_data.key AS result_data_key, result_data.value AS result_data_value FROM result_data WHERE 1436 = result_data.result_id;

The resultsdb database does not have the extra index added with alembic - it should be as identical to production as it can be given that I'm not using the actual production db machine. The stg query is a bit longer than what I'm seeing in practice (I suspect caching since it's an old run - multiple queries got the execution time of both results down) but I'm concerned that the production query is taking that long. The staging database is larger than the production database so if the timeouts are a function of many slower queries, it looks like we may start seeing these failures in production before too long.

I'm still not all that confident that I've found the root cause yet and would welcome outside opinions/thoughts. I'm still digging into it, though.

tflink commented 9 years ago

I tried taking a different approach to the problem - try to figure out how to keep apache from killing the request instead of figuring out why things are taking too long. I think that I know most of what's going on now.

The job I've been working with is upgradepath with 198 results in the single job.

The rough process for submitting results is:
create a job in resultsdb
submit results, linked to the parent job
* change the job's status to be 'COMPLETED'

The problem lies not in changing the job's status in the database like I previously thought but in how the api is responding to the update_job request as a whole - serializing the job and all of the results associated with that job. The serialization ends up doing 1 SELECT for each linked result and if those queries are taking 250ms like they are on my local setup, that makes the response creation process take longer than mod_wsgi thinks is appropriate and httpd kills it, returning a 504 to the resultsdb_api client. When I short-circuit the response to return a static amount of json w/ a HTTP 200, I can no longer reproduce the issue in my local setup.

For a longer term solution, I think that we need to figure out how to get those query times down - 250ms to get the details of a result seems too long for what should be a simple operation. However, I also think we need to get a quicker before we start seeing the same timeout in production. My first thought is to not include the serialized results with the update_job PUT but I'm open to other ideas.

This issue brings up another question - do we need to start looking into replacing/improving resultsdb? As @jskladan mentioned, resultsdb wasn't designed to be a long-term thing - we always intended to replace it with something better but so far, have been putting that off. I'd like to get disposable clients in place before we take on any other large projects but the question needed to be asked :)

jskladan commented 9 years ago

! In #452#5430, @tflink wrote:
The problem lies not in changing the job's status in the database like I previously thought but in how the api is responding to the update_job request as a whole - serializing the job and all of the results associated with that job. <snip />
My first thought is to not include the serialized results with the update_job PUT but I'm open to other ideas.

I agree, from what I see in the code, the serialized data returned from the update/store_result, are actually not used anyway, so I'd support returning just the status code by default. I've been playing, for quite some time, with the idea of adding "data on demand" (for the lack of better name) functionality to the API - e.g. give a possibility to get just the job's information when querying for jobs (meaning, no result data), etc. together with lazy object population and smarter serialization. So adding an option to the API queries, which would allow to just return the status code, instead of the whole serialized object, seems quite reasonable to me.

This issue brings up another question - do we need to start looking into replacing/improving resultsdb? As @jskladan mentioned, resultsdb wasn't designed to be a long-term thing - we always intended to replace it with something better but so far, have been putting that off. I'd like to get disposable clients in place before we take on any other large projects but the question needed to be asked :)

For the beginning, I'd like to know why is it so slow in the first place, and whether custom-SQL or running stored-procedures to populate the SQL-Alchemy objects would be beneficial. We have discussed some changes in the result organization during your stay in Brno, so it is given that changes will need to be made in the future. I'd just want to know where the bottleneck is, as the queries we are running are so simple, that the time they take to run just does not make any sense.

tflink commented 9 years ago

! In #452#5432, @jskladan wrote:

! In #452#5430, @tflink wrote:
The problem lies not in changing the job's status in the database like I previously thought but in how the api is responding to the update_job request as a whole - serializing the job and all of the results associated with that job. <snip />
My first thought is to not include the serialized results with the update_job PUT but I'm open to other ideas.

I agree, from what I see in the code, the serialized data returned from the update/store_result, are actually not used anyway, so I'd support returning just the status code by default. I've been playing, for quite some time, with the idea of adding "data on demand" (for the lack of better name) functionality to the API - e.g. give a possibility to get just the job's information when querying for jobs (meaning, no result data), etc. together with lazy object population and smarter serialization. So adding an option to the API queries, which would allow to just return the status code, instead of the whole serialized object, seems quite reasonable to me.

Sounds good to me. I'd like to get this timeout issue figured out before we start allowing users to request sets of data that are going to generate error messages but I suspect that was implied.

For now, we'll just not return the serialized result payload with job actions.

This issue brings up another question - do we need to start looking into replacing/improving resultsdb? As @jskladan mentioned, resultsdb wasn't designed to be a long-term thing - we always intended to replace it with something better but so far, have been putting that off. I'd like to get disposable clients in place before we take on any other large projects but the question needed to be asked :)

For the beginning, I'd like to know why is it so slow in the first place, and whether custom-SQL or running stored-procedures to populate the SQL-Alchemy objects would be beneficial. We have discussed some changes in the result organization during your stay in Brno, so it is given that changes will need to be made in the future. I'd just want to know where the bottleneck is, as the queries we are running are so simple, that the time they take to run just does not make any sense.

Agreed - I think that the 'why' here is an important question. I just see the answer for that involving a deep dive into postgres tuning and optimal schema structure and was worried that production would start seeing the timeouts before we got to a real root cause of why a seemingly simple query is taking so long to execute. The next question is whether it's more important than working on disposable clients and I think it is - if we're putting off figuring out why depcheck is behaving like it is, I think this can wait as long as we have a "short" term fix that allows Taskotron to keep on working.

jskladan commented 9 years ago

! In #452#5446, @tflink wrote:
Agreed - I think that the 'why' here is an important question. I just see the answer for that involving a deep dive into postgres tuning and optimal schema structure and was worried that production would start seeing the timeouts before we got to a real root cause of why a seemingly simple query is taking so long to execute. The next question is whether it's more important than working on disposable clients and I think it is - if we're putting off figuring out why depcheck is behaving like it is, I think this can wait as long as we have a "short" term fix that allows Taskotron to keep on working.

+1 I'll provide the API change/fix ASAP

tflink commented 9 years ago

I've updated my local instances with new packages containing the fixes from this ticket and can no longer reproduce the issue. I'm going to update dev and stg with the new packages, hopefully the failures will go away with the workaround.

jskladan commented 9 years ago

OK, two hours in, and I have managed find out the core issue causing the problem:
EXPLAIN ANALYZE SELECT * FROM result WHERE 115515 = result.job_id;

                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on result  (cost=0.00..59799.04 rows=60 width=183) (actual time=239.394..239.499 rows=1 loops=1)
   Filter: (job_id = 115515)
   Rows Removed by Filter: 1510399
 Total runtime: 239.532 ms

Unlike MySQL, Postgres does not automatically create indexes on Foreign Keys - here, the result.job_id column. After adding the index, everything behaves as one would usually expect:
CREATE INDEX result_fk_job_id ON result USING btree (job_id);
EXPLAIN ANALYZE SELECT * FROM result WHERE 115515 = result.job_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using result_fk_job_id on result (cost=0.43..10.48 rows=60 width=183) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: (115515 = job_id)
Total runtime: 0.048 ms

I'll have a look at SqlAlchemy's docs to find out whether it has an option to create indexes on the ForeignKeys automagically (for future projects), and will add the index creation calls to the proper places, preferably in a separate ticket, though :)

I think this one can be marked as resolved, what do you think, @tflnk?

tflink commented 9 years ago

Great find! I'm glad to see that the change is relatively simple even if figuring out why has not been :)

Pretty much solved, yeah but let's not count our chickens before they're hatched. Given how long stg took to process the alembic script last time to add an index, I suspect that this will also involve a decent downtime while the indexes are created. We can deploy to dev/stg now and prod once this is all figured out and running in stg for a bit.

jskladan commented 9 years ago

From what I did in my virt machine, the index is created in a matter of tens of seconds - IMHO mostly because it is creating the index on an integer key, instead of composite string.
We'll see, how it goes :)

tflink commented 8 years ago

It's been a while since we deployed the fix for this and I'm thinking that we can mark it as resolved - thoughts?

jskladan commented 8 years ago

Agreed, seems to work fine. Closing

Metadata Update from @kparal:
- Issue tagged with: infrastructure

6 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

High

taskotron

Source Code

#97 HTTP 504 Gateway Timeout for ResultsDB calls Closed: Fixed None Opened 9 years ago by kparal.

Metadata

infra

#97 HTTP 504 Gateway Timeout for ResultsDB calls

Closed: Fixed None Opened 9 years ago by kparal.