#9153 Use an automated script to control checksums of compose images
Opened 4 years ago by lruzicka. Modified 2 years ago

Describe the issue

The following test is one of the QE activities that we should do during the Fedora release stages: https://fedoraproject.org/wiki/QA:Testcase_Mediakit_Checksums. The results of the test are reported here for every nominated compose: https://fedoraproject.org/wiki/Test_Results:Fedora_32_Rawhide_20200101.n.0_Installation?rd=Test_Results:Current_Installation_Test#Image_sanity.

Often, this test case is not covered often, as it is too complicated to download all of the images and run the commands. We use several automated test cases in OpenQA to run tests on every compose and this one would be another good candidate, because it is very time demanding to do it manually, while very easy to automate.

However, it does not make sense to download all of the images and use a script in OpenQA or in some local machine, it only makes sense to run such tests on the machine, where the images are physically stored and ideally report the results automatically.

When do you need this? (2020/02/10)

It would be nice, if such functionality was ready before branching of Fedora 32.

When is this no longer needed or useful? (never expires)

It is always useful, until the Fedora release criteria are changed to drop this one.

If we cannot complete your request, what is the impact?

The test case will not be probably fully covered. However, if somebody provides some info, on how the images are created and stored, I could lend my hand and try to create the script.

I invite @adamwill, @robatino and @mohanboddu to discuss.

Thanks


Metadata Update from @mohanboddu:
- Issue tagged with: meeting

4 years ago

(Thanks @lruzicka for filing this, I was just about to do the same.)

I just want to point out that we're not asking releng to fill out our test matrices 🙂 We want to get rid of those matrices, and trust releng that those compose artifacts have been published correctly. The way to do that is to have automated checksum verification either at the very end of the compose process, or after the artifacts have been published, i.e. at https://kojipkgs.fedoraproject.org/compose/ and/or https://dl.fedoraproject.org/pub/alt/stage/ (the latter is best, because it also guards us against transmission errors).

Releng can do this very efficiently and their scripts can blow up in case of an invalid checksum, therefore immediately making it clear that something bad has happened during the compose. QA can do it very inefficiently and just for select composes. That's why I think releng is the best place to do this. (In the past I sometimes asked @mohanboddu to verify the checksums manually and then I filled out the test results, but that approach really doesn't scale.)

I believe we're happy to help you implement this, if you need any help.

I think this is a great thing to do. :)

Possible suggestion though: I wonder if we could get pungi to add a phase at the end to generically run QA type scripts? Then you maintain the script and we just pull a copy and run it at the end of pungi runs?

We could start with checksums, but could add other things that make sense.

Or would it just be easier to run this seperately after the compose has synced?

I didn't realize @lruzicka had filed this, so I filed basically the same thing myself, where I outlined how I'd see this working, at https://pagure.io/releng/issue/9159 . I or @lruzicka could write most of the actual stuff there; all we'd need co-ordination on is actually deploying it and the wiki credentials.

I somewhat disagree here with @adamw. His proposal is quite complex (listeners, timing issues, wiki reporting, etc) when this is an extremely trivial thing of sha256sum -c *CHECKSUM. I really think this should be a part of releng tooling/scripts and QA shouldn't even track that in our matrices. When checksums are not correct, either the compose should blow up (as it does now for different causes, and QA doesn't perform any testing during that time, it simply blows up when something is incorrect) or the transfer should blow up (when the bits weren't transferred properly to the master location).

Possible suggestion though: I wonder if we could get pungi to add a phase at the end to generically run QA type scripts? Then you maintain the script and we just pull a copy and run it at the end of pungi runs?
We could start with checksums, but could add other things that make sense.

I don't have any other ideas at the moment apart from verifying the checksums, so I was hoping simply to add a few lines at the end of some script. But if think we should set up a generic QA process that gets run at the end of Pungi compose or compose transfer, I don't mind, and we have no problems maintaining that QA script. The only thing I'm a bit afraid of is that when this verification is separated from the main compose/transfer code, we might miss some changes (like when ISOs are added/removed/renamed/moved, directory structure changes, etc) that would've been otherwise visible if the code lived together.

But I have only vague ideas how the compose gets done. I'm sure releng knows best where to place the checksum check. So tell us what you think works best and how we can help, and we'll try to help :-)

I think this is a great thing to do. :)
Possible suggestion though: I wonder if we could get pungi to add a phase at the end to generically run QA type scripts? Then you maintain the script and we just pull a copy and run it at the end of pungi runs?
We could start with checksums, but could add other things that make sense.

I like this idea.

Or would it just be easier to run this seperately after the compose has synced?

For now we can just add these tests to the nightly.sh script which runs the compose.

For now we can just add these tests to the nightly.sh script which runs the compose.

Seems good to me, thank you very much. Do you need me to help with something?

@kparal well, it's not that I'm proposing reporting to the wiki exactly, it's that as things stand, that's a thing that needs to be done. The requirement is in the release criteria and we have it in the wiki matrices already. So I saw 'get the result to the wiki' as a checklist item that the proposal had to achieve, not a part of the thing that was actually being proposed.

Baking the check into the compose process does mean that if the check breaks somehow we stop getting composes, and means we have to go through the whole X hour compose cycle again to get a new compose when it happens. But I'm not opposed to it necessarily.

@kparal well, it's not that I'm proposing reporting to the wiki exactly, it's that as things stand, that's a thing that needs to be done. The requirement is in the release criteria and we have it in the wiki matrices already.

Once the verification is part of releng pipeline, we don't need to have it in release criteria. Or we can keep it there, but remove it from the wiki (we have many criteria which are not covered by wiki). I guess we can discuss this separately so that we don't put OT into this releng ticket.

Baking the check into the compose process does mean that if the check breaks somehow we stop getting composes, and means we have to go through the whole X hour compose cycle again to get a new compose when it happens.

That's why I want to keep it extremely simple, something that must not break. But this is exactly the problem I meant when I said releng knows best where to incorporate it. If the checksum files are created at the end of the whole compose, it makes little sense to create them and immediately verify them. If the checksum files are created individually after each directory/spin/edition is created, it makes some sense to verify them all together at the very end.

It also makes sense to verify checksums after transferring the compose to the master location, and honestly that's probably the most useful use case for us. It should be simple (in the script that rsync's the files, do ssh & find checksums & exec sha256sum), it makes problems quickly visible to releng and others if written well (if you do rsync to .tmp && verify checksums && move from .tmp to proper directory, then the compose will not be visible at all if checksums fail, making it obvious something went wrong) , it protects the most important composes we care about (release candidates), and if an error happens that's our fault and not tooling fault, it should be trivial to fix (fix the command, move the compose to proper location, no need to redo compose or anything).

I somewhat disagree here with @adamw. His proposal is quite complex (listeners, timing issues, wiki reporting, etc) when this is an extremely trivial thing of sha256sum -c *CHECKSUM. I really think this should be a part of releng tooling/scripts and QA shouldn't even track that in our

Only doing sha256sum -c *CHECKSUM does not verify that all images are actually mentioned in the CHECKSUM file.

Only doing sha256sum -c *CHECKSUM does not verify that all images are actually mentioned in the CHECKSUM file.

But it will check if all the files have an entry in that file, right? So if the files have been successfully built and put somewhere along with the CHECKSUM file, it will test that their checksums are there and correct.

Only doing sha256sum -c *CHECKSUM does not verify that all images are actually mentioned in the CHECKSUM file.

That is true and a good point for implementation, but that is also exactly what our QA testcase currently recommends. When I wrote it, I didn't mean the command literally. Yes, checking each image presence in the CHECKSUM file is a good idea and it means the actual implementation should include a few extra commands.

From RelEng meeting from today.

#info RelEng thinks we should update the nightly script to run "sha256sum -c", "checkisomd5" tests and verify that the images are all in the checksum file and post the results to the compose email and if anyone of them fails, we should not sync the compose to the mirrors. We will ask for comments in the ticket

Please let us know what you think?

From RelEng meeting from today.
RelEng thinks we should update the nightly script to run "sha256sum -c", "checkisomd5" tests and verify that the images are all in the checksum file and post the results to the compose email and if anyone of them fails, we should not sync the compose to the mirrors. We will ask for comments in the ticket

Thank you, I think it is great.

if anyone of them fails, we should not sync the compose to the mirrors

Or mark the compose as DOOMED (perhaps just in case the failure occurred for release blocking images, but maybe for any image) or something similar which is very visible. But overall, yes, that would be great, thanks.

Metadata Update from @syeghiay:
- Issue assigned to humaton

4 years ago

Metadata Update from @cverna:
- Assignee reset
- Issue tagged with: groomed

3 years ago

Metadata Update from @mohanboddu:
- Issue untagged with: meeting

3 years ago

Metadata Update from @humaton:
- Issue assigned to humaton

2 years ago

I have a user account on secondary01.fedoraproject.org and have been running the checksum tests manually by logging into it when an RC appears and is synced to https://dl.fedoraproject.org/pub/alt/stage/ . I can continue to do this as long as each RC continues to be synced to stage before the test results are needed. (Of course automating it would be better.)

Login to comment on this ticket.

Metadata