Learn more about these different git repos.
Other Git URLs
There are several possible inconsistencies in the database concerning builds' timestamps.
Firstly, there over 800 builds that have started_on > ended_on. Most of these are failed, however, there are 8 successful and 16 canceled. The successful ones are quite old -- the most recent one (it's the build for fedora-24-x86_64) is from July 2016. The most recent failed and canceled ones are from this year's February.
started_on
ended_on
There are more than 80,000 builds that have started_on null but valid ended_on. All of these are failed. An example from April.
Additionally, there are over 35,000 builds that have submitted_on > started_on (An example from April.), and 3000 builds with submitted_on > ended_on -- another example from April.
submitted_on
It should probably be noted that I found these inconsistencies in a db dump from the end of April, so I had no newer data.
Metadata Update from @msuchy: - Issue tagged with: bug
We could add (at least in PostgreSQL case) the CHECK (stop is null or stop > start) constraint.
CHECK (stop is null or stop > start)
Metadata Update from @praiskup: - Issue assigned to schlupov
I made selects and updates for these 4 issues with data in the database.
--started_on > ended_on --select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc; update build_chroot set started_on=ended_on-1 where started_on>ended_on; --started_on null but valid ended_on --select * from build_chroot where started_on is null and ended_on is not null order by ended_on desc; update build_chroot set started_on=ended_on-1 where started_on is null and ended_on is not null; --submitted_on > started_on --select id,submitted_on, started_on from build join build_chroot on build_id=id where submitted_on>started_on order by started_on desc; update build set submitted_on=build_chroot.started_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.started_on; --submitted_on > ended_on --select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc; update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;
Firstly, there over 800 builds that have started_on > ended_on.
I found only 240 builds today.
Most of these are failed, however, there are 8 successful and 16 canceled.
These numbers are smaller as well.
The successful ones are quite old -- the most recent one (it's the build for fedora-24-x86_64) is from July 2016.
Agreed, 374953 | 1 | fedora | 24 | x86_64 | 2016-07-10 01:42:23+00 | 2016-07-10 00:44:13+00
374953 | 1 | fedora | 24 | x86_64 | 2016-07-10 01:42:23+00 | 2016-07-10 00:44:13+00
The most recent failed and canceled ones are from this year's February.
716320 | 2 | fedora | 27 | x86_64 | 2018-02-16 10:53:25+00 | 2018-02-16 10:53:23+00 716283 | 2 | fedora | rawhide | i386 | 2018-02-16 09:49:19+00 | 2018-02-16 09:49:12+00 716283 | 2 | fedora | rawhide | x86_64 | 2018-02-16 09:49:20+00 | 2018-02-16 09:49:10+00
<img alt="started-after-end.txt.xz" src="/copr/copr/issue/raw/files/dc0525c7dd05ff0b3296dc35082743ed4219deead018d6def7faeea6d48bcb7a-started-after-end.txt.xz" />
Executing update build_chroot set started_on=ended_on-1 where started_on>ended_on; on them.
update build_chroot set started_on=ended_on-1 where started_on>ended_on;
select count(1) from build_chroot where started_on is null and ended_on is not null ; 140375
Hmm, this one is not resolved. Last build is: https://copr.fedorainfracloud.org/coprs/lkundrak/playground/build/1032002/
select build_id, status, os_release, os_release, arch, to_timestamp(submitted_on) as submitted, to_timestamp(started_on) as started, to_timestamp(ended_on) as ended from build_chroot join build on build.id = build_chroot.build_id join mock_chroot on mock_chroot.id = build_chroot.mock_chroot_id where started_on < submitted_on order by build_id desc; ... 80667
This one increased as well, last one:
1022827 | 8 | epel | epel | ppc64le | 2019-04-10 00:48:53+00 | 2019-04-10 00:47:43+00 | 2019-04-10 00:50:18+00
And this is the last with non-forked status: https://copr.fedorainfracloud.org/coprs/arturh/intel-opencl-experimental/build/881709/
submitted_after_started.txt.xz
Commit 2fcd6f3 fixes this issue
We still need to adjust the database.
Metadata Update from @praiskup: - Issue status updated to: Open (was: Closed)
Metadata Update from @praiskup: - Issue assigned to praiskup (was: schlupov)
Reassigned to me, thank you @schlupov for now!
I'll give the problem a chance for re-appear for a month or so, and then I'll clean the database if this is resolved.
Meh, still appears. 3 months ago:
coprdb=# select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc; One appearance: https://copr.fedorainfracloud.org/coprs/bnied/kernel-lt-aufs/build/1159974/ --started_on null but valid ended_on --select * from build_chroot where started_on is null and ended_on is not null order by ended_on desc;
coprdb=# select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc; One appearance: https://copr.fedorainfracloud.org/coprs/bnied/kernel-lt-aufs/build/1159974/
--started_on null but valid ended_on --select * from build_chroot where started_on is null and ended_on is not null order by ended_on desc;
Many (i'm lazy to estimate how many are new ATM), but e.g. https://copr.fedorainfracloud.org/coprs/g/copr/copr-dev/build/1220198/
--submitted_on > started_on --select id,submitted_on, started_on from build join build_chroot on build_id=id where submitted_on>started_on order by started_on desc;
Many, e.g. 1330655. https://copr.fedorainfracloud.org/coprs/praiskup/copr-test-long-action/build/1330655/ (that seems to be fork).
--submitted_on > ended_on --select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc; update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;
This one seems to be fixed. Though I tried to run the sql statement locally and it did not go well:
coprdb=# update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on; UPDATE 1713 coprdb=# select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc; +--------+--------------+------------+ | id | submitted_on | ended_on | +--------+--------------+------------+ | 881172 | 1554846740 | 1554846739 | | 851076 | 1548601474 | 1548601463 | | 744782 | 1524554169 | 1524554158 | | 743628 | 1524262347 | 1524262345 | | 733070 | 1522148802 | 1522148801 | | 733070 | 1522148802 | 1522148801 | | 733070 | 1522148802 | 1522148800 | | 733070 | 1522148802 | 1522148800 | | 733070 | 1522148802 | 1522148799 | | 720276 | 1519411862 | 1519411861 | | 720276 | 1519411862 | 1519411860 | +--------+--------------+------------+
I'm turning this back to @schlupov (no hurry, we can discuss what to do about this). But I can not close this issue since it is not yet resolved.
Metadata Update from @praiskup: - Assignee reset
Meh, still appears. 3 months ago: coprdb=# select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc; One appearance: https://copr.fedorainfracloud.org/coprs/bnied/kernel-lt-aufs/build/1159974/
there is still only this one build
there is still 139014 builds with started_on set to null. Last one is from 2020-02-17. Update run as: update build_chroot set started_on=ended_on-1 where started_on is null and ended_on is not null;
update build_chroot set started_on=ended_on-1 where started_on is null and ended_on is not null;
Many (i'm lazy to estimate how many are new ATM), but e.g. https://copr.fedorainfracloud.org/coprs/g/copr/copr-dev/build/1220198/ --submitted_on > started_on --select id,submitted_on, started_on from build join build_chroot on build_id=id where submitted_on>started_on order by started_on desc;
I found 74054 builds with submitted_on>started_on. Last one is from 2019-04-10. Update run as: update build set submitted_on=build_chroot.started_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.started_on;
update build set submitted_on=build_chroot.started_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.started_on;
about 16000
--submitted_on > ended_on --select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc;
I found 5399 builds with submitted_on>ended_on
update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on; This one seems to be fixed. Though I tried to run the sql statement locally and it did not go well: coprdb=# update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on; UPDATE 1713 coprdb=# select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc; +--------+--------------+------------+ | id | submitted_on | ended_on | +--------+--------------+------------+ | 881172 | 1554846740 | 1554846739 | | 851076 | 1548601474 | 1548601463 | | 744782 | 1524554169 | 1524554158 | | 743628 | 1524262347 | 1524262345 | | 733070 | 1522148802 | 1522148801 | | 733070 | 1522148802 | 1522148801 | | 733070 | 1522148802 | 1522148800 | | 733070 | 1522148802 | 1522148800 | | 733070 | 1522148802 | 1522148799 | | 720276 | 1519411862 | 1519411861 | | 720276 | 1519411862 | 1519411860 | +--------+--------------+------------+
update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;
This one seems to be fixed. Though I tried to run the sql statement locally and it did not go well: coprdb=# update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on; UPDATE 1713 coprdb=# select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc; +--------+--------------+------------+ | id | submitted_on | ended_on | +--------+--------------+------------+ | 881172 | 1554846740 | 1554846739 | | 851076 | 1548601474 | 1548601463 | | 744782 | 1524554169 | 1524554158 | | 743628 | 1524262347 | 1524262345 | | 733070 | 1522148802 | 1522148801 | | 733070 | 1522148802 | 1522148801 | | 733070 | 1522148802 | 1522148800 | | 733070 | 1522148802 | 1522148800 | | 733070 | 1522148802 | 1522148799 | | 720276 | 1519411862 | 1519411861 | | 720276 | 1519411862 | 1519411860 | +--------+--------------+------------+
I tried to run the command three times and it worked, only the update is somehow gradual
I might have figured out the issue causing inconsistent timestamps. Maybe, maybe not, however, I was able to reproduce it.
The started_on and ended_on timestamps are sent from backend to frontend so my guess is that this happens if backend server gets out of sync with time and then gets fixed during a build.
coprdb=# ... where started_on>ended ... (1 row)
Now I submitted a build and wait for it to get into a running state (not SRPM, but for a build chroot).
$ docker exec -it copr_backend-build_1 bash [root@backend /] # date Thu Aug 6 20:02:31 UTC 2020
And I've readjusted the system time ... this may be caused by something on our production servers - maybe NTP? Maybe running playbooks? Maybe reboot? I am not sure yet.
[root@backend /] # date +%T -s "19:02:31" 19:02:31
Wait once the build finishes and bingo, it has inconsistent timestamps.
coprdb=# ... where started_on>ended ... (2 rows)
My point being - can we not use timestamps from backend but rather let PostgreSQL generate them? I am not sure if it would solve this particular issue but makes more sense to me.
My alternative theory is that we do this
if upd_dict.get("status") in BuildsLogic.terminal_states: build_chroot.ended_on = upd_dict.get("ended_on") or time.time() if upd_dict.get("status") == StatusEnum("starting"): build_chroot.started_on = upd_dict.get("started_on") or time.time()
So if the value is not sent from backend, we use time.time() from frontend - which may have a slightly different time than backend and cause the issue. I am gonna investigate further if I can find out a case when backend sends None here.
time.time()
None
I am gonna investigate further if I can find out a case when backend sends None here.
Now, with BuildBackgroundWorker, the code looks fairly simple and I cannot see any way how the started_on or ended_on wouldn't get sent. But even before the rework, with a slightly more complicated logic, I think it was just fine. So IMHO my original theory with readjusting system time during a build on the backend might be valid.
BuildBackgroundWorker
We may get the times desynchronized, though:
this may be caused by something on our production servers - maybe NTP?
This shouldn't happen. NTP should fix the time gaps by speeding up/slowing down the system time.
Maybe running playbooks?
Maybe, but that would be bug I guess. We can fix large time offset right after the first VM start, but never again.
Maybe reboot? I am not sure yet.
We reboot rarely.
.. anyway, I already thought that we only set the times only on frontend... One might be afraid that the time will be imprecise ... but it shouldn't. When backend sets the time, it actually immediately sends the data to frontend by http request. So, let's set the times on frontend only (I don't think it needs to be PostgreSQL database).
All we want to achieve in this issue is to assure that
1. When ended_on is set => started_on is set => submitted_on is set 2. ended_on (if set) >= started_on >= submitted_on
It's valid to have those two scenarios:
=> started_on is null and ended_on is null => started_on not null and ended_on is null
So with a bit of more fresh head now, here is my response for problems above, and summary:
started_on > ended_on
This has still one occurence:
select build_chroot.id, os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on mock_chroot.id = mock_chroot_id where started_on > ended_on order by build_id desc;
I fixed it manually by:
update build_chroot set ended_on = 1579452889 where id = 2186781;
Problem seems to be SOLVED. Completely
started_on == null && ended_on != null
Nothing extra new here, it's good! There's only one new build which means that this can nowadays only happen when we don't set the started_on field when we are canceling the build:
select * from build_chroot where started_on is null and ended_on is not null order by build_id desc;
Note that the build 1650511 is the newest build id with this problem, and the second newest one is 1242142 (pretty old one).
Proposed solution here: We shouldn't set ended_on for canceled build chroots. If builder was fast enough to finish the build_chroot in canceled build, OK (it will be valid timestamp). But if not, leave it null.
Otherwise it looks there's no more problems in our codebase here!
Proposed DB fix for this was correct:
update build_chroot set started_on = ended_on-1 where started_on is null and ended_on is not null;
submitted_on > started_on
This indeed has some new instances, but as can be seen here from the following query, these are all "forks" which just inherited the original (and very old) messed build_chroot entries:
select build.id, status, submitted_on, started_on from build join build_chroot on build_id = build.id where submitted_on>started_on order by build.id desc;
Number of builds:
select count(1) from ( select build.id from build join build_chroot on build_id = build.id where submitted_on > started_on group by build.id) as foo; ┌───────┐ │ count │ ├───────┤ │ 23210 │ └───────┘ (1 row)
The proposed sql query to fix this inconsistency is wrong:
Per PostgreSQL documentation:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_item list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
See https://www.postgresql.org/docs/12/sql-update.html
There are generally multiple instances of build_chroot per build. If we have no guarantee that the lowest value will be used, one submitted_on will be replaced by one started_on, but there might be other build_chroot.started_on values with even lower value.
I propose this query instead:
update build set submitted_on = subquery.min - 1 from ( select build.id as id, min(started_on) as min from build join build_chroot on build_id = build.id where submitted_on > started_on group by build.id ) as subquery where build.id = subquery.id;
Please review.
submitted_on > ended_on
This is a weird combination of problem 2. and 3. That's because of the premise
ended_on is set => started_on is set => submitted_on is set
So here ended_on is set, submitted_on is set, but started_on isn't:
select build.id,submitted_on, started_on, ended_on from build join build_chroot on build_id=build.id where submitted_on>ended_on order by build.id desc;
Since Problem 1 is fixed even now, when we have 2 fixed (fill the missing started_on fields) and then we fix 3 (we'll fid the submitted_on) we'll have fix for problem 4 for free.
So I propose to:
@schlupov, can you take a look at code-fix for Problem 2, and review the awfully long text above?
Problem 1: This has been solved. I can't find any new records. Problem 2: See PR https://pagure.io/copr/copr/pull-request/1519. Problem 3: Code
is ok. I ran the update on my database dump and all 23210 records were updated. Problem 4: Will be solved by previous problems.
Commit 65b3435 relates to this ticket
We are waiting for a new release here, then we'll run the SQL commands and close this.
Ok, I've run the SQL queries both on production and devel. Done.
Metadata Update from @praiskup: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.