#304 Release blocking status of basic functionality tests for apps
Closed: Fixed 2 years ago by aday. Opened 2 years ago by aday.

We've had a bunch of blocker bugs be proposed against some of our preinstalled apps very late in the F36 cycle. They are:

Various questions here:

  • Should basic functionality tests for all our apps be treated as potential release blockers?
  • How and why were these issues found so late in the cycle?
  • What are our quality expectations for preinstalled apps?

I'll take a shot at this.

  • Should basic functionality tests for all our apps be treated as potential release blockers?

Of course. The basic functionality test has dramatically improved the quality of Fedora releases by ensuring we don't ship broken stuff. If we're unable to maintain the default apps properly, then perhaps we should remove them.

How and why were these issues found so late in the cycle?

I'm not sure about totem.

In the case of gnome-calendar and gnome-contacts, I rarely report bugs a long time ago because these apps are frankly so buggy that I'm more surprised when they do work properly than when they do not.

We do have a late blocker rule that allows us to waive any issue that would otherwise be considered a blocker if it is reported late. The longer the release cycle runs, the more time QA has for testing more things and finding more bugs.

What are our quality expectations for preinstalled apps?

I would say high enough that we shouldn't ship with bugs like these if reasonably possible.

How and why were these issues found so late in the cycle?

I believe there's a common pattern. The worst bugs discovered extremely late often happen to be in applications that, I believe, very few people from our audience use. And by our audience I mean people testing pre-release images. I'm talking about apps like:
gnome-contacts
gnome-photos
gnome-calendar
gnome-maps
(In the past, gnome-documents was also a fine example)

The problem is that a super-basic functionality is more or less working for these apps (and I'm looking mainly at the first four listed). So if QA quickly test it (i.e. spend a few minutes in them), it most probably works fine. The same applies to test days. But if you start actually using it for a bit longer, you discover lots of issues, very frequently. Unfortunately QA cannot really spend a lot of time in these apps, testing every button and using it for a long time. Also, they only need to work for the Final milestone, so until Beta, QA hardly touches them, because we have more than enough work with Beta-required stuff. On top of that, around Beta the GNOME megaupdate arrives and changes everything.

Apps that get used regularly by a large audience (including QA) don't suffer from these issues, or at least not that frequently. Nautilus, gnome-terminal, gedit, firefox, libreoffice, calculator - those are well tested throughout the whole cycle. And because testers are actually using those every day, bugs are quickly discovered.

The totem bug is a bit different. I think totem is actually used regularly, but that doesn't include its Videos homepage. I think very few people actually use that functionality, and that's why nobody noticed before.

Why are some of those gnome-* apps not used much? (At least by testers, I have no clue about the general audience).

For me, it boils down to:
1. Opening a new tab in web browser is same effort as opening a standalone app. Contacts, photos, calendar, maps, weather... all of that can be easily done through web.
2. The functionality/feature set of standalone apps is often limited compared to their web counterparts. Instead of offering more, they offer less.
3. If I use the web app, I don't need to be concerned about bugs or imperfections, because I'm using the source (e.g. Google Contacts). Using the standalone app has the risk of not functioning properly (and I got burned many times in the past).
4. The quality of those standalone apps is often not as high as I'd be comfortable with. The web apps often work better.

In the case of gnome-calendar and gnome-contacts, I rarely report bugs a long time ago because these apps are frankly so buggy that I'm more surprised when they do work properly than when they do not.

Yes, that's a nice way to put it. I believe many people suffer from this 'lost all hope' fatigue. I consider many of these apps to be half-baked (not trying to offend anyone). They have been half-baked for so long, and there's no clear improvement happening, that reporting bugs except the most severe crashes might easily feel like a wasted effort.

We discussed this issue at today's workstation working group meeting.

As far as F36 is concerned, I think the consensus was that the contacts and photos issues should not block the release - they came in too late, will be difficult to fix at short notice, and are not the most popular apps, meaning that users are relatively unlikely to encounter them.

The consensus was also that we shouldn't remove these apps from the release, on the basis that these issues will still affect F35 users, and we want to avoid changing the composition of the install from release to release.

Looking forward to future releases, we have a number of options:

  1. Restricting the basic functionality criterion to a more limited set of preinstalled apps [1]
  2. Specifying in more detail what basic functionality means from app to app (it could be limited for some apps and more extensive for others)
  3. Removing some apps from the install media
  4. Taking steps to improve the maintenance of certain apps, and to improve testing [2].

Personally, I think that we should probably review each of the apps were quality is a concern, with a view to seeing what can be done, and whether we want to keep the app. Once we've done that, we'll be in a better position to decide whether to change the release criteria.

[1] If we were to do this, we'd want to be certain that this wouldn't reduce the amount of testing received by lower priority apps, or dilute our aspiration to have high quality across the set of apps
[2] For example, we could track and put out a call for apps that need maintainers, and we can work to improve the test plans for workstation apps.

"[1] If we were to do this, we'd want to be certain that this wouldn't reduce the amount of testing received by lower priority apps"

In practice, this would definitely happen so far as distribution validation testing is concerned. We (the Fedora QA team) are bashing on Photos and Contacts and Calendar and so on right now (during Final validation testing) because of this criterion. If it's changed, we won't do that any more. I think that's kinda inevitable.

Still, I think this leads to a question: is testing as part of distribution validation really the best/most effective testing for these apps? Doesn't the fact that we're finding critical bugs in very basic features in these apps at the point of Fedora final release testing suggest that the real problem is that some testing is missing upstream? It seems, on the face of it, a bit odd that a "GNOME 42" "stable" release can happen which contains applications that are, let's face it, entirely broken.

At the distribution validation level - especially at Final stage - we're kinda not expecting to be hitting bugs of the "this application is just fundamentally broken" type. We're expecting to be hitting tricky integration bugs, or bugs in fairly deep functionality. When we wrote this criterion we kinda thought of it as a sanity check - I never really expected it to expose as many problems as it has the last few cycles. I was more expecting it to catch the odd case of a broken dev version of an app accidentally being included in a stable release, or some ancient unmaintained app not being dropped from an image, or something. Not cases where something that is ostensibly a maintained core application for a major desktop just has huge bugs in it.

If it's changed, we won't do that any more. I think that's kinda inevitable.

Functionally like the optical media testing. QA no longer tests it, but it's still blocking if someone finds that it's not working. Apps need users who need the app to work to file bugs, and apps need maintainers to fix the bugs, but they also need users or why bother? There's a circular dependency.

Expectation:

Still, I think this leads to a question: is testing as part of distribution validation really the best/most effective testing for these apps? Doesn't the fact that we're finding critical bugs in very basic features in these apps at the point of Fedora final release testing suggest that the real problem is that some testing is missing upstream? It seems, on the face of it, a bit odd that a "GNOME 42" "stable" release can happen which contains applications that are, let's face it, entirely broken.

Reality: The real testing happens in Fedora does not begin until Fedora beta is released.

I'm exaggerating slightly. Arch and Ubuntu and Debian and Gentoo users all test GNOME and report bugs too. But a disproportionate amount of this testing comes from Fedora, and from Fedora QA in particular.

Moving the Fedora beta releases earlier in the schedule would help align things a little better IMO. (I've already pushed back GNOME releases as far as possible: one week later would cause problems for Ubuntu, which like GNOME has a very longstanding schedule it wants to adhere to, even more so than Fedora does.)

Expectation:

Still, I think this leads to a question: is testing as part of distribution validation really the best/most effective testing for these apps? Doesn't the fact that we're finding critical bugs in very basic features in these apps at the point of Fedora final release testing suggest that the real problem is that some testing is missing upstream? It seems, on the face of it, a bit odd that a "GNOME 42" "stable" release can happen which contains applications that are, let's face it, entirely broken.

Reality: The real testing happens in Fedora does not begin until Fedora beta is released.

It used to happen earlier when we had Alphas. Then we said "Rawhide is Alpha", but then we didn't put it anywhere for people to easily choose to do that. We also have branched nightlies, but outside of one particularly hidden site, there's no exposure like the Beta release has. We also don't offer up on the brochure site a way for people to get composes to try. We don't even update the brochure sites with RCs either. So what are people supposed to test? Naturally, the beta.

I'm exaggerating slightly. Arch and Ubuntu and Debian and Gentoo users all test GNOME and report bugs too. But a disproportionate amount of this testing comes from Fedora, and from Fedora QA in particular.

Moving the Fedora beta releases earlier in the schedule would help align things a little better IMO. (I've already pushed back GNOME releases as far as possible: one week later would cause problems for Ubuntu, which like GNOME has a very longstanding schedule it wants to adhere to, even more so than Fedora does.)

Actually, I would like GNOME's schedule to move earlier rather than later. There's no room to get things fixed because of how late it releases.

If GNOME relies on Fedora to actually get quality engineering done, then that means things need to be earlier not later.

I don't think that's entirely accurate, or very relevant. It's not entirely accurate because the test was never marked as Alpha, and everybody hates running that test so it rarely gets run unless it has to be run. It's always been either Beta or Final; it may have always been Final, I forget. But it's never been Alpha.

It's also not true to say that nightlies don't get any exposure, because we do create validation events for them and announce those to test-announce. They are not widely publicized outside of Fedora's internal channels, but that's not the same thing as "no exposure".

There is a sub-issue here that in the 36 cycle we did run this test at Beta, but the person who did it then went with a very minimal interpretation of "basic functionality", whereas the people who ran it at Final went with a broader one. That's something we could work on. However, to me, this is all details and the broad picture is that distribution release validation testing is not supposed to be upstream functionality testing. If GNOME is relying on Fedora's distribution validation process for functional testing of its applications, to me, that is clearly not the best situation, and with my Fedora QA hat on, I can say that's not what it's meant for and we are not going to optimize the process for it unless it's very painless to do so.

Actually, I would like GNOME's schedule to move earlier rather than later. There's no room to get things fixed because of how late it releases.

If GNOME relies on Fedora to actually get quality engineering done, then that means things need to be earlier not later.

Well we tried going earlier for a couple years, but this just seemed to result in lower-quality .0 releases. The goal is for GNOME's .0 to be good. But the earlier it is relative to Fedora's beta, the worse it will be in practice.

We're now seven weeks out from the .0 release. That's plenty of time to find bugs. I don't think going earlier would have helped at all....

I've reopened the gnome-photos issue, and have created issues to evaluate gnome-calendar and gnome-contacts:

It would be good for us to look at these for F37, and use the results to help us decide what to do about this issue.

If there any other apps that should also be evaluated, just shout.

If there any other apps that should also be evaluated, just shout.

I'd suggest also consider gnome-maps. It hasn't been particularly broken this release, there were some issues in the past, but that's not my point. I see two major issues:

  1. I don't understand the app existence. I need to be online in order to use it, so why wouldn't I use it in a web browser instead (openstreetmap or google maps or our local national maps)? What benefits does the standalone app have? I would only understand it if I had a Fedora phone or something, but I don't understand it on a PC.
  2. The major functionality, searching, is near unusable. It's not a bug, it's just... poor. See this issue and this recent comment from our QA member. I can't even find major Czech cities in it (and the results are not even deterministic, sometimes it works better, sometimes worse). The online maps don't suffer from this, of course I'll use those instead.

I don't object to the app existence per say, I don't care (if it is useful for someone). I just don't understand why we have it in the default install, which means we need to burn time and energy on it each cycle.

Has GNOME ever considered merging some apps? Like gnome-music with totem and gnome-photos with eog, for example. This might improve the app audience, increase testing and avoid fragmentation.

EDIT: My first notion of Michael's reply was that Fedora QA is doing too little. I might have misinterpreted what it meant, because folks in my team think it meant the opposite.

Reality: The real testing happens in Fedora does not begin until Fedora beta is released.

This is not just slightly exaggerated. This is quite exaggerated. Before I joined the Fedora QA team, I was a happy Xorg and Fluxbox user and never had any Gnome experience. Joining the Fedora QA, I stopped using Xorg, stopped using Fluxbox and became a sole daily user of Gnome on Wayland to be able to see possible bugs as early as possible. However, I have been facing some bug reporting obstacles that nobody had mentioned in this thread:

  • Gnome teams have left Bugzilla for their Gitlab and Gitlab issues and reporting a bug to Bugzilla gets little attention from them. This means these bugs often go unnoticed and unsolved, although reported early.

  • Gnome issues are spread all over various repositories and it often happens that reporting a bug under a certain application receives an instant closure because it is not their application problem. I would expect that Gnome developers with much more Gnome (and its libraries and components) experience should be the ones to redirect the issue instead of just closing it, leaving the reporter helpless with little energy to file the bug elsewhere hoping for better results.

Also, I am the one who performed the earlier testing of Gnome applications and did not find those late blockers. With those applications, I checked that they started, looked a little bit around, checked the menus, checked the About section, so when I added my first contact, it was long after the first 30 seconds, so I did not see the 25 seconds crash that @kparal found. Also, with Calendar, I did not attempt to delete several events quickly in a row, but I added one, edited it, and deleted it. Also, I did not try to make three events that would spread over several months, because this is not how I use the calendar.

I sort of feel like the blame ball is in my playground now, but for me, the apps were working as I wanted to use them.

I'm exaggerating slightly. Arch and Ubuntu and Debian and Gentoo users all test GNOME and report bugs too. But a disproportionate amount of this testing comes from Fedora, and from Fedora QA in particular.

It is very interesting that all those Arch, Ubuntu, Debian and Gentoo users did not find the bugs either. They could have been fixed a long time ago.

Thanks for your feedback, @lruzicka !

Gnome teams have left Bugzilla for their Gitlab and Gitlab issues and reporting a bug to Bugzilla gets little attention from them. This means these bugs often go unnoticed and unsolved, although reported early.

That's #131 - there are plans to make some improvements there.

Gnome issues are spread all over various repositories and it often happens that reporting a bug under a certain application receives an instant closure because it is not their application problem.

If you experience this again, please let the working group know about it (ideally in a new issue). Once we have specific examples we can raise the issue upstream and try and improve things.

Metadata Update from @aday:
- Issue untagged with: meeting
- Issue tagged with: qa

2 years ago

If you experience this again, please let the working group know about it (ideally in a new issue). Once we have specific examples we can raise the issue upstream and try and improve things.

On GNOME GitLab, issues can be moved from one component to another. I would expect and hope for developers to move the issue to a more appropriate component when known. This will cause the original issue to be closed in the original project, and a new issue to be opened in the new project with a copy of the comments from the original issue. Closing the original issue outright should only be required when moving to a different bugtracker is required. (E.g. half of Epiphany bugs are really WebKit bugs and need to be reported on WebKit Bugzilla for any hope of resolution, so best resolution on GNOME GitLab is to close them.)

The workstation working group discussed this issue again during Tuesday's meeting (that was 10 May), and I've created tickets for actions which I think there was support for:

  • Improving the documentation for preinstalled app acceptance tests - #310
  • Making rawhide+GNOME mainline images continuously available for testing - #311
  • Tightening up our existing testing arrangements - #312

It didn't seem that there was a huge appetite for us to review Contacts, Photos and Calendar ourselves, though I still feel that that would be a useful exercise.

I've also created a QA issue tag in this project, which can be used to track workstation QA initiatives more generally.

We can continue to use this ticket to review the release criteria for preinstalled apps. Copying those here for reference:

"for Fedora Workstation on the x86_64 architecture, all applications installed by default which can be launched from the Activities menu must meet this [basic functionality] requirement. "

"Basic functionality means that the app must at least be broadly capable of its most basic expected operations, and that it must not crash without user intervention or with only basic user intervention."

Metadata Update from @chrismurphy:
- Issue tagged with: meeting

2 years ago

Metadata Update from @chrismurphy:
- Issue untagged with: meeting

2 years ago

This issue has resurfaced for F37. The following gnome-calendar bug is being treated as a release blocker, despite the working group's view that it should be ignored:

https://bugzilla.redhat.com/show_bug.cgi?id=2111003

It would therefore seem that we're going to have to investigate whether we can change the default application basic functionality requirement for workstation.

The simplest solution would be to only apply the basic functionality requirement to the list of apps that is used for the other Fedora flavours.

Changing the criterion does not make sense to me. That criterion exists to stop us from releasing horribly broken stuff. Fact is Contacts is currently horribly broken. I think it will be fixed in time because Niels is working on it -- I just chatted with him yesterday to touch base regarding that issue -- but if it's not fixed in time for release, we should remove Contacts. Removing the app is an acceptable way to avoid blocker issues if we feel the app is not important enough to truly block release on (it isn't).

Changing the criterion does not make sense to me. That criterion exists to stop us from releasing horribly broken stuff. Fact is Contacts is currently horribly broken.

So the working group discussed Contacts, Calendar and Photos recently. In all three cases, there are known issues that go against the basic functionality criterion, and in all three cases, the WG decided that it wanted to keep the apps. If we think that the basic functionality criterion should apply, we should move to remove these apps ourselves.

Well at the expense of rinse and repeat - if it's going to be imminently fixed, and sounds like it is, and also within the lifecycle of the release in question, and that too sounds like it's the case - don't block and don't remove it? I know it's more than a bit comical, because we keep kicking this can. But let's say when it becomes clear it will not be suitably fixed anytime in both f37 and f38 lifecyle, then sure, remove it. It being any app.

Well at the expense of rinse and repeat - if it's going to be imminently fixed, and sounds like it is, and also within the lifecycle of the release in question, and that too sounds like it's the case

The main question, I think, is whether this is just a single issue, or whether it is an example of a larger set of equivalent issues that we are not tracking.

My sense is that it's the latter. We know that there are quality issues around certain apps. We know that there are issues like Photos crashing when you try to import [1], and Calendar having trouble deleting events [2, 3]. I know for certain that we shipped Photos for several releases, even though it would crash if you tried to edit the colours in a photo.

If this isn't just a single issue, then it feels like we're missing the real problem, and we're going to find ourselves in the same position again as QA identifies more problems.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2082732
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2079356
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2079792

In the end it just boils down to what you (desktop SIG) want your product (Workstation) to be. All I want is for the policies to be in line.

If you want to ship broken apps, we have to change the criterion - as suggested, we can change it so only the specified set of apps is 'guaranteed' on Workstation x86_64 too.

If you want to fix broken apps or remove them, the criteria are fine, and we just need to make sure we're identifying the brokenness early enough. As I pointed out on the bug, we took that feedback from last cycle and are testing those apps earlier, which is why we're finding bugs like the Contacts one two months before final release.

In the end it just boils down to what you (desktop SIG) want your product (Workstation) to be. All I want is for the policies to be in line.

Right. The working group has recently decided to continue shipping apps with known quality issues. To be consistent we should either change the policy, or change which apps we ship.

(Assuming that the quality issues are real.)

Is there some language or test to help categorize bugs as "cosmetic" versus "non-functional"? Often there's a gray area there and it might help to just have a qualifier or test for this distinction. Cosmetic bug means some functionality may be lost but there are other aspects of the app that still work and provide function.

The working group discussed this issue yesterday. The consensus there was that we want to continue to apply the basic functionality criterion to all preinstalled apps. Testing these apps, identifying blockers, and seeking to have them fixed was judged to be an effective way to ensure quality.

We know that there are some preinstalled apps that are not heavily used, and that have quality issues. When blockers are identified in these apps, we will have to decide what to do on a case by case basis. If a blocker issue really does affect basic functionality, and we don't want to block a release on the app, then we'll have to choose between fixing the issue or removing the app from the preinstalled app set.

In this regard, the request to wave the blocker status of 2111003 was incorrect. The correct thing would have been to propose that we remove the app from the preinstalled app set if the issue wasn't fixed prior to final.

Metadata Update from @aday:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Is there some language or test to help categorize bugs as "cosmetic" versus "non-functional"?

Are you thinking about importance versus severity? (Where importance is how commonly used a feature is, or how central it is to the product's role, and severity is the degree to which the issue affects the user.)

If a blocker issue really does affect basic functionality, and we don't want to block a release on the app, then we'll have to choose between fixing the issue or removing the app from the preinstalled app set.

I'll just add that "basic functionality" is a very vague definition and we'll be very happy to receive your feedback during blocker review meetings or in our discussion tickets. If you feel that something has been judged incorrectly, please raise that topic and we can reopen the discussion.

Are you thinking about importance versus severity? (Where importance is how commonly used a feature is, or how central it is to the product's role, and severity is the degree to which the issue affects the user.)

Both? I guess it depends. If a feature is seldom used, the issue probably won't affect many users. But either sound like a useful way to make "basic functionality" less vague.

Both? I guess it depends. If a feature is seldom used, the issue probably won't affect many users. But either sound like a useful way to make "basic functionality" less vague.

Right. In my mind, "basic functionality" is a composite of severity and importance. If each were a 5 point scale, you could say (for example) that an issue breaks basic functionality if it simultaneously gets an importance rating of ≥3 and a severity rating of ≥4.

Login to comment on this ticket.

Metadata