#342 builds timestamps are inconsistent
Closed: Fixed 3 years ago by praiskup. Opened 5 years ago by dturecek.

There are several possible inconsistencies in the database concerning builds' timestamps.

Firstly, there over 800 builds that have started_on > ended_on. Most of these are failed, however, there are 8 successful and 16 canceled. The successful ones are quite old -- the most recent one (it's the build for fedora-24-x86_64) is from July 2016.
The most recent failed and canceled ones are from this year's February.

There are more than 80,000 builds that have started_on null but valid ended_on. All of these are failed. An example from April.

Additionally, there are over 35,000 builds that have submitted_on > started_on (An example from April.), and 3000 builds with submitted_on > ended_on -- another example from April.

It should probably be noted that I found these inconsistencies in a db dump from the end of April, so I had no newer data.


Metadata Update from @msuchy:
- Issue tagged with: bug

5 years ago

We could add (at least in PostgreSQL case) the CHECK (stop is null or stop > start) constraint.

Metadata Update from @praiskup:
- Issue assigned to schlupov

4 years ago

I made selects and updates for these 4 issues with data in the database.

--started_on > ended_on
--select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc;
update build_chroot set started_on=ended_on-1 where started_on>ended_on;

--started_on null but valid ended_on
--select * from build_chroot where started_on is null and ended_on is not null order by ended_on desc;
update build_chroot set started_on=ended_on-1 where started_on is null and ended_on is not null;

--submitted_on > started_on
--select id,submitted_on, started_on from build join build_chroot on build_id=id where submitted_on>started_on order by started_on desc;
update build set submitted_on=build_chroot.started_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.started_on;

--submitted_on > ended_on
--select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc;
update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;

Firstly, there over 800 builds that have started_on > ended_on.

I found only 240 builds today.

Most of these are failed, however, there are 8 successful and 16
canceled.

These numbers are smaller as well.

The successful ones are quite old -- the most recent one (it's the build
for fedora-24-x86_64) is from July 2016.

Agreed, 374953 | 1 | fedora | 24 | x86_64 | 2016-07-10 01:42:23+00 | 2016-07-10 00:44:13+00

The most recent failed and canceled ones are from this year's February.

   716320 |      2 | fedora     | 27         | x86_64  | 2018-02-16 10:53:25+00 | 2018-02-16 10:53:23+00
   716283 |      2 | fedora     | rawhide    | i386    | 2018-02-16 09:49:19+00 | 2018-02-16 09:49:12+00
   716283 |      2 | fedora     | rawhide    | x86_64  | 2018-02-16 09:49:20+00 | 2018-02-16 09:49:10+00

There are more than 80,000 builds that have started_on null but valid ended_on. All of these are failed. An example from April.

select count(1) from build_chroot where started_on is null and ended_on is not null ;
140375

Hmm, this one is not resolved. Last build is:
https://copr.fedorainfracloud.org/coprs/lkundrak/playground/build/1032002/

Additionally, there are over 35,000 builds that have submitted_on > started_on
(An example from April.), and 3000 builds with submitted_on > ended_on --
another example from April.

select build_id, status, os_release, os_release, arch,
to_timestamp(submitted_on) as submitted, to_timestamp(started_on) as started,
to_timestamp(ended_on) as ended from build_chroot join build on build.id =
build_chroot.build_id join mock_chroot on mock_chroot.id =
build_chroot.mock_chroot_id where started_on < submitted_on order by build_id
desc;
...
80667

This one increased as well, last one:

  1022827 |      8 | epel          | epel          | ppc64le | 2019-04-10 00:48:53+00 | 2019-04-10 00:47:43+00 | 2019-04-10 00:50:18+00

We still need to adjust the database.

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 years ago

Metadata Update from @praiskup:
- Issue assigned to praiskup (was: schlupov)

4 years ago

Reassigned to me, thank you @schlupov for now!

I'll give the problem a chance for re-appear for a month or so, and then I'll clean the database if this is resolved.

Meh, still appears. 3 months ago:

coprdb=# select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc;
One appearance:
https://copr.fedorainfracloud.org/coprs/bnied/kernel-lt-aufs/build/1159974/

--started_on null but valid ended_on
--select * from build_chroot where started_on is null and ended_on is not null order by ended_on desc;

Many (i'm lazy to estimate how many are new ATM), but e.g. https://copr.fedorainfracloud.org/coprs/g/copr/copr-dev/build/1220198/

--submitted_on > started_on
--select id,submitted_on, started_on from build join build_chroot on build_id=id where submitted_on>started_on order by started_on desc;

Many, e.g. 1330655. https://copr.fedorainfracloud.org/coprs/praiskup/copr-test-long-action/build/1330655/ (that seems to be fork).

--submitted_on > ended_on
--select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc;
update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;

This one seems to be fixed. Though I tried to run the sql statement locally and it
did not go well:

coprdb=# update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;
UPDATE 1713
coprdb=# select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc;
+--------+--------------+------------+
|   id   | submitted_on |  ended_on  |
+--------+--------------+------------+
| 881172 |   1554846740 | 1554846739 |
| 851076 |   1548601474 | 1548601463 |
| 744782 |   1524554169 | 1524554158 |
| 743628 |   1524262347 | 1524262345 |
| 733070 |   1522148802 | 1522148801 |
| 733070 |   1522148802 | 1522148801 |
| 733070 |   1522148802 | 1522148800 |
| 733070 |   1522148802 | 1522148800 |
| 733070 |   1522148802 | 1522148799 |
| 720276 |   1519411862 | 1519411861 |
| 720276 |   1519411862 | 1519411860 |
+--------+--------------+------------+

I'm turning this back to @schlupov (no hurry, we can discuss what to do about this).
But I can not close this issue since it is not yet resolved.

Metadata Update from @praiskup:
- Assignee reset

4 years ago

Metadata Update from @praiskup:
- Issue assigned to schlupov

4 years ago

Meh, still appears. 3 months ago:

coprdb=# select os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on id=mock_chroot_id where started_on>ended_on order by ended_on desc;
One appearance:
https://copr.fedorainfracloud.org/coprs/bnied/kernel-lt-aufs/build/1159974/

there is still only this one build

--started_on null but valid ended_on
--select * from build_chroot where started_on is null and ended_on is not null order by ended_on desc;

there is still 139014 builds with started_on set to null. Last one is from 2020-02-17. Update run as:
update build_chroot set started_on=ended_on-1 where started_on is null and ended_on is not null;

Many (i'm lazy to estimate how many are new ATM), but e.g. https://copr.fedorainfracloud.org/coprs/g/copr/copr-dev/build/1220198/

--submitted_on > started_on
--select id,submitted_on, started_on from build join build_chroot on build_id=id where submitted_on>started_on order by started_on desc;

I found 74054 builds with submitted_on>started_on. Last one is from 2019-04-10. Update run as:
update build set submitted_on=build_chroot.started_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.started_on;

Many, e.g. 1330655. https://copr.fedorainfracloud.org/coprs/praiskup/copr-test-long-action/build/1330655/ (that seems to be fork).

about 16000

--submitted_on > ended_on
--select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc;

I found 5399 builds with submitted_on>ended_on

update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;

This one seems to be fixed. Though I tried to run the sql statement locally and it
did not go well:
coprdb=# update build set submitted_on=build_chroot.ended_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.ended_on;
UPDATE 1713
coprdb=# select id,submitted_on, ended_on from build join build_chroot on build_id=id where submitted_on>ended_on order by ended_on desc;
+--------+--------------+------------+
| id | submitted_on | ended_on |
+--------+--------------+------------+
| 881172 | 1554846740 | 1554846739 |
| 851076 | 1548601474 | 1548601463 |
| 744782 | 1524554169 | 1524554158 |
| 743628 | 1524262347 | 1524262345 |
| 733070 | 1522148802 | 1522148801 |
| 733070 | 1522148802 | 1522148801 |
| 733070 | 1522148802 | 1522148800 |
| 733070 | 1522148802 | 1522148800 |
| 733070 | 1522148802 | 1522148799 |
| 720276 | 1519411862 | 1519411861 |
| 720276 | 1519411862 | 1519411860 |
+--------+--------------+------------+

I tried to run the command three times and it worked, only the update is somehow gradual

I might have figured out the issue causing inconsistent timestamps. Maybe, maybe not, however, I was able to reproduce it.

The started_on and ended_on timestamps are sent from backend to frontend so my guess is that this happens if backend server gets out of sync with time and then gets fixed during a build.

coprdb=# ... where started_on>ended
...
(1 row)

Now I submitted a build and wait for it to get into a running state (not SRPM, but for a build chroot).

$ docker exec -it copr_backend-build_1 bash
[root@backend /] # date
Thu Aug  6 20:02:31 UTC 2020

And I've readjusted the system time ... this may be caused by something on our production servers - maybe NTP? Maybe running playbooks? Maybe reboot? I am not sure yet.

[root@backend /] # date +%T -s "19:02:31"                                                                               
19:02:31

Wait once the build finishes and bingo, it has inconsistent timestamps.

coprdb=# ... where started_on>ended
...
(2 rows)

My point being - can we not use timestamps from backend but rather let PostgreSQL generate them? I am not sure if it would solve this particular issue but makes more sense to me.


My alternative theory is that we do this

if upd_dict.get("status") in BuildsLogic.terminal_states:
    build_chroot.ended_on = upd_dict.get("ended_on") or time.time()

if upd_dict.get("status") == StatusEnum("starting"):
    build_chroot.started_on = upd_dict.get("started_on") or time.time()

So if the value is not sent from backend, we use time.time() from frontend - which may have a slightly different time than backend and cause the issue. I am gonna investigate further if I can find out a case when backend sends None here.

I am gonna investigate further if I can find out a case when backend sends None here.

Now, with BuildBackgroundWorker, the code looks fairly simple and I cannot see any way how the started_on or ended_on wouldn't get sent. But even before the rework, with a slightly more complicated logic, I think it was just fine. So IMHO my original theory with readjusting system time during a build on the backend might be valid.

We may get the times desynchronized, though:

this may be caused by something on our production servers - maybe NTP?

This shouldn't happen. NTP should fix the time gaps by speeding up/slowing down the system time.

Maybe running playbooks?

Maybe, but that would be bug I guess. We can fix large time offset right after the first VM start, but never again.

Maybe reboot? I am not sure yet.

We reboot rarely.

.. anyway, I already thought that we only set the times only on frontend... One might be afraid that the time will be imprecise ... but it shouldn't. When backend sets the time, it actually immediately sends the data to frontend by http request. So, let's set the times on frontend
only (I don't think it needs to be PostgreSQL database).

All we want to achieve in this issue is to assure that

1. When ended_on is set => started_on is set => submitted_on is set
2. ended_on (if set) >= started_on >= submitted_on

It's valid to have those two scenarios:

=> started_on is null and ended_on is null
=> started_on not null and ended_on is null

So with a bit of more fresh head now, here is my response for problems above, and summary:


Problem 1 - started_on > ended_on

This has still one occurence:

select build_chroot.id, os_release, os_version, build_id, ended_on, started_on from mock_chroot join build_chroot on mock_chroot.id = mock_chroot_id where started_on > ended_on order by build_id desc;

I fixed it manually by:

update build_chroot set ended_on = 1579452889 where id = 2186781;

Problem seems to be SOLVED. Completely


Problem 2 - started_on == null && ended_on != null

Nothing extra new here, it's good! There's only
one new build
which means that this can nowadays only happen when we don't set the
started_on field when we are canceling the build:

select * from build_chroot where started_on is null and ended_on is not null order by build_id desc;

Note that the build 1650511 is the newest build id with this problem, and the
second newest one is 1242142 (pretty old one).

Proposed solution here: We shouldn't set ended_on for canceled build chroots.
If builder was fast enough to finish the build_chroot in canceled build, OK (it
will be valid timestamp). But if not, leave it null.

Otherwise it looks there's no more problems in our codebase here!

Proposed DB fix for this was correct:

update build_chroot set started_on = ended_on-1 where started_on is null and ended_on is not null;

Problem 3 - submitted_on > started_on

This indeed has some new instances, but as can be seen here from the following
query, these are all "forks" which just inherited the original (and very old)
messed build_chroot entries:

select build.id, status, submitted_on, started_on from build join build_chroot on build_id = build.id where submitted_on>started_on order by build.id desc;

Number of builds:

select count(1) from (
    select build.id
    from build
    join build_chroot on build_id = build.id
    where submitted_on > started_on
    group by build.id) as foo;

┌───────┐
│ count │
├───────┤
│ 23210 │
└───────┘
(1 row)

The proposed sql query to fix this inconsistency is wrong:

update build set submitted_on=build_chroot.started_on-1 from build_chroot where build.id=build_chroot.build_id and build.submitted_on>build_chroot.started_on;

Per PostgreSQL documentation:

When a FROM clause is present, what essentially happens is that the
target table is joined to the tables mentioned in the from_item list,
and each output row of the join represents an update operation for the
target table. When using FROM you should ensure that the join produces
at most one output row for each row to be modified. In other words, a
target row shouldn't join to more than one row from the other table(s).
If it does, then only one of the join rows will be used to update the
target row, but which one will be used is not readily predictable.

See https://www.postgresql.org/docs/12/sql-update.html

There are generally multiple instances of build_chroot per build. If we have no
guarantee that the lowest value will be used, one submitted_on will be replaced
by one started_on, but there might be other build_chroot.started_on values with
even lower value.

I propose this query instead:

update build set
    submitted_on = subquery.min - 1
from (
    select build.id as id, min(started_on) as min
    from build join build_chroot on build_id = build.id
    where submitted_on > started_on group by build.id
) as subquery where build.id = subquery.id;

Please review.


Problem 4 - submitted_on > ended_on

This is a weird combination of problem 2. and 3. That's because of the premise

ended_on is set => started_on is set => submitted_on is set

So here ended_on is set, submitted_on is set, but started_on isn't:

select build.id,submitted_on, started_on, ended_on from build join build_chroot on build_id=build.id where submitted_on>ended_on order by build.id desc;

Since Problem 1 is fixed even now, when we have 2 fixed (fill the missing
started_on fields) and then we fix 3 (we'll fid the submitted_on) we'll have
fix for problem 4 for free.


What next?

So I propose to:

  1. Fix the code for canceling the build.
  2. Apply DB query to fix problem 2
  3. Apply DB query to fix problem 3
  4. Problem 4 solved

@schlupov, can you take a look at code-fix for Problem 2, and review the awfully long text above?

Problem 1: This has been solved. I can't find any new records.
Problem 2: See PR https://pagure.io/copr/copr/pull-request/1519.
Problem 3: Code

update build set
submitted_on = subquery.min - 1
from (
    select build.id as id, min(started_on) as min
    from build join build_chroot on build_id = build.id
    where submitted_on > started_on group by build.id
) as subquery where build.id = subquery.id;

is ok. I ran the update on my database dump and all 23210 records were updated.
Problem 4: Will be solved by previous problems.

Commit 65b3435 relates to this ticket

We are waiting for a new release here, then we'll run the SQL commands and close this.

Ok, I've run the SQL queries both on production and devel. Done.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Attachments 1
Attached 4 years ago View Comment
Related Pull Requests
  • #1519 Merged 3 years ago
  • #1031 Merged 4 years ago