#60 Jobs are sometimes executed twice (with the same UUID)
Closed: Duplicate 6 years ago by kparal. Opened 6 years ago by kparal.

There's a race condition that seems to occur once or multiple times per hour that causes exactly the same tasks (including the same UUID) to be executed twice. Both tasks fail, because they overwrite their files in the output directories. Here's an example:

https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/5910
https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/5911

Compare the Build Properties tab, it's exactly the same job.

I wonder if this can be an issue with fedmsg when we receive some message twice. But in that case, it should receive a different UUID, right? If that's true, this seems to be an issue with trigger instead, or trigger<->buildbot integration.

I haven't seen this before, so this also might be an issue with the latest buildbot 1.7+ (currently deployed on dev).

CC @frantisekz @tflink @jskladan


In jobs.csv, it's mentioned only once for armhfp:

[root@taskotron-dev01 ~][PROD]# grep 'boost-nowide-0-20171029.gitec9672b.fc30' /var/log/taskotron-trigger/jobs.csv
2019-02-11 21:30:06.436575;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmlint;x86_64;noarch
2019-02-11 21:30:06.592033;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmlint;armhfp;noarch
2019-02-11 21:30:06.675796;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;python-versions;noarch;noarch
2019-02-11 21:30:06.760844;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmgrill;noarch;noarch
2019-02-11 21:30:06.858730;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;abicheck;x86_64;noarch
2019-02-11 21:30:06.987246;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;abicheck;armhfp;noarch

But, I can't find a job for the abicheck;x86_64 part, neither in buildbot nor in resultsdb! It almost seems like the 5910 job should have been scheduled for x86_64 (with a unique UUID), but it was scheduled for armhfp instead...

This is not related to just abicheck and specific arches. Here's an example of double-scheduled rpmgrill on noarch:

https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/6040
https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/6041

And that job should have been scheduled just once:

[root@taskotron-dev01 ~][PROD]# grep libcue-2.1.0-8.fc30 /var/log/taskotron-trigger/jobs.csv
2019-02-11 22:18:01.989463;libcue-2.1.0-8.fc30;koji_build;rpmlint;x86_64;noarch
2019-02-11 22:18:02.107293;libcue-2.1.0-8.fc30;koji_build;rpmlint;armhfp;noarch
2019-02-11 22:18:02.167086;libcue-2.1.0-8.fc30;koji_build;python-versions;noarch;noarch
2019-02-11 22:18:02.244493;libcue-2.1.0-8.fc30;koji_build;rpmgrill;noarch;noarch
2019-02-11 22:18:02.325802;libcue-2.1.0-8.fc30;koji_build;abicheck;x86_64;noarch
2019-02-11 22:18:02.425296;libcue-2.1.0-8.fc30;koji_build;abicheck;armhfp;noarch

Here's trigger log for the first example:

[2019-02-11 21:30:06][fedmsg.consumers    INFO] triggering abicheck for boost-nowide-0-20171029.gitec9672b.fc30 on x86_64 (run on noarch host)
[2019-02-11 21:30:06][fedmsg.consumers    INFO] BuildbotRunner received:
item: boost-nowide-0-20171029.gitec9672b.fc30
item_type: koji_build
taskname: abicheck
arch: x86_64
host_arch: noarch
[2019-02-11 21:30:06][fedmsg.consumers    INFO] post-hook build request status: 200
[2019-02-11 21:30:06][fedmsg.consumers    INFO] triggering abicheck for boost-nowide-0-20171029.gitec9672b.fc30 on armhfp (run on noarch host)
[2019-02-11 21:30:06][fedmsg.consumers    INFO] BuildbotRunner received:
item: boost-nowide-0-20171029.gitec9672b.fc30
item_type: koji_build
taskname: abicheck
arch: armhfp
host_arch: noarch
[2019-02-11 21:30:07][fedmsg.consumers    INFO] post-hook build request status: 200

And for the second one:

[2019-02-11 22:18:02][fedmsg.consumers    INFO] triggering rpmgrill for libcue-2.1.0-8.fc30 on noarch (run on noarch host)
[2019-02-11 22:18:02][fedmsg.consumers    INFO] BuildbotRunner received:
item: libcue-2.1.0-8.fc30
item_type: koji_build
taskname: rpmgrill
arch: noarch
host_arch: noarch
[2019-02-11 22:18:02][fedmsg.consumers    INFO] post-hook build request status: 200

My theory was that the POST request gets repeated on error or something, but it doesn't seem so. I'm starting to think this might be a buildbot issue.

Buildbot's twistd.log is not much helpful:

2019-02-11 21:30:06+0000 [-] added buildset 5910 to database
...
2019-02-11 21:30:08+0000 [-] starting build <Build all number:5910 results:success>.. pinging the worker <WorkerForBuilder builder='all' worker='qa11-15' state=BUILDING>
...
2019-02-11 21:30:08+0000 [-] acquireLocks(build <Build all number:5910 results:success>, locks [(<WorkerLock(worker_builds, 1)[qa11-15] 140601999098040>, <buildbot.locks.LockAccess object at 0x7fe0764ff2e8>)])
...
2019-02-11 21:31:30+0000 [-]  <Build all number:5910 results:failure>: build finished
2019-02-11 21:31:30+0000 [-] releaseLocks(<Build all number:5910 results:failure>): [(<WorkerLock(worker_builds, 1)[qa11-15] 140601999098040>, <buildbot.locks.LockAccess object at 0x7fe0764ff2e8>)]

Doesn't explain why it was run twice.

Metadata Update from @kparal:
- Issue close_status updated to: Duplicate
- Issue status updated to: Closed (was: Open)

6 years ago

Log in to comment on this ticket.

Metadata