Can we get MariaDB GitHub CI to consistently be green?
Hi! Thanks to everyone who has worked on polishing the CI tests and integrations. Looking at e.g. 10.11 branch[1] I see that commit ccb7a1e[2] has a green checkmark next to it and all 15 CI jobs passed. I just wanted to check if everyone developing the MariaDB Server is committed to getting the CI to be consistently green? This means that GitHub rules need to be a bit more strict, not allowing any failing tests jobs at all, and developers and managers need to agree to stop adding new commits on any release branch if it can't be done without a fully passing CI run. Thanks to Daniel's review on latest failures we know these are at the moment recurring: * MDEV-25614 Galera test failure on GCF-354 included in MDEV-33073 always green buildbot * MDEV-33785 aarch64 macos encryption.create_or_replace * MDEV-33601 galera_3nodes.galera_safe_to_bootstrap test failing * MDEV-33786 galera_3nodes.galera_vote_rejoin_mysqldump mysql_shutdown failed at line 85: I see two approaches to get to consistently green CI: 1) Stop all development and focus on just fixing these, don't continue until CI is fully green, and once it is fully green make the GitHub branch protection settings one notch stricter to not allow any new commits unless the CI is fully green so it never regresses again. 2) Disable these tests and make the rules in GitHub branch protection one notch stricter right away, and not allow any new commits unless the CI is fully green ensuring no new recurring failures are introduced. - Otto [1] https://github.com/MariaDB/server/commits/10.11 [2] https://github.com/MariaDB/server/commit/ccb7a1e9a15e6a47aba97f9bdbfab2e4bf6...
CI bugs are being treated very seriously at the moment via MDEV-33073 always green buildbot, being a Blocker bug that includes all the CI failures that we notice. If you notice any others, like you did last week, do mention them or include them as part of this parent task if they aren't already. I was looking at the Fedora skip lists - https://src.fedoraproject.org/rpms/mariadb10.11/tree/rawhide - some I see are fixed, but I've yet to find their CI page to see what is currently missing. On Wed, 3 Apr 2024 at 01:33, Otto Kekäläinen via developers <developers@lists.mariadb.org> wrote:
Hi!
Thanks to everyone who has worked on polishing the CI tests and integrations.
Looking at e.g. 10.11 branch[1] I see that commit ccb7a1e[2] has a green checkmark next to it and all 15 CI jobs passed.
I just wanted to check if everyone developing the MariaDB Server is committed to getting the CI to be consistently green?
This means that GitHub rules need to be a bit more strict, not allowing any failing tests jobs at all, and developers and managers need to agree to stop adding new commits on any release branch if it can't be done without a fully passing CI run.
Thanks to Daniel's review on latest failures we know these are at the moment recurring:
* MDEV-25614 Galera test failure on GCF-354 included in MDEV-33073 always green buildbot * MDEV-33785 aarch64 macos encryption.create_or_replace * MDEV-33601 galera_3nodes.galera_safe_to_bootstrap test failing * MDEV-33786 galera_3nodes.galera_vote_rejoin_mysqldump mysql_shutdown failed at line 85:
I see two approaches to get to consistently green CI:
1) Stop all development and focus on just fixing these, don't continue until CI is fully green, and once it is fully green make the GitHub branch protection settings one notch stricter to not allow any new commits unless the CI is fully green so it never regresses again.
2) Disable these tests and make the rules in GitHub branch protection one notch stricter right away, and not allow any new commits unless the CI is fully green ensuring no new recurring failures are introduced.
- Otto
[1] https://github.com/MariaDB/server/commits/10.11 [2] https://github.com/MariaDB/server/commit/ccb7a1e9a15e6a47aba97f9bdbfab2e4bf6... _______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-leave@lists.mariadb.org
Hi!
CI bugs are being treated very seriously at the moment via MDEV-33073 always green buildbot, being a Blocker bug that includes all the CI failures that we notice.
Thanks for the reply Daniel, you have always been one of those taking the CI very seriously. The reason I wrote to the developers mailing list is that I wish to raise this for a wider audience and get input from both core contributors and other contributors. For example Trevor (CC'd as I am not sure if he is on this list) filed https://github.com/MariaDB/server/pull/2958 which failed in CI. Since the mainline was already failing ("red") and the PR submission showed lots of failing tests, Trevor had to do a lot of extra work figuring out which tests failed due to his changes, and which ones were already broken (which led to 3 separate PRs now in #3075, #3076 and #3077). I suspect core developers don't suffer from failing CI to the same extent as they simply bypass it, or have much more time on their hands and can spend time learning what failures can be ignored which week and month. The fact that the CI is not green seems to be a topic where the core developers are perhaps a bit blind to the bigger picture, while non-core contributors struggle with the extra work it incurs. Also in the eyes of the wider public, a constantly failing CI erodes trust in quality. While I understand that the natural reply is "we will get to green soon" and it makes a lot of sense, I am afraid it might be a overly optimistic. We've had in the past recurring the situation that Daniel, Sergei and Monty all say the same week they want to fix all failing tests, but it only lasts for a short while and then we are back to failures on mainline CI. Thus, to permanently enforce have CI green on mainline branches I proposed:
I see two approaches to get to consistently green CI:
1) Stop all development and focus on just fixing these, don't continue until CI is fully green, and once it is fully green make the GitHub branch protection settings one notch stricter to not allow any new commits unless the CI is fully green so it never regresses again.
2) Disable these tests and make the rules in GitHub branch protection one notch stricter right away, and not allow any new commits unless the CI is fully green ensuring no new recurring failures are introduced.
What do other developers think about this? - Otto
Hi, Otto, On Apr 09, Otto Kekäläinen via developers wrote:
Hi!
CI bugs are being treated very seriously at the moment via MDEV-33073 always green buildbot, being a Blocker bug that includes all the CI failures that we notice.
Thanks for the reply Daniel, you have always been one of those taking the CI very seriously.
The reason I wrote to the developers mailing list is that I wish to raise this for a wider audience and get input from both core contributors and other contributors.
For example Trevor (CC'd as I am not sure if he is on this list) filed https://github.com/MariaDB/server/pull/2958 which failed in CI. Since the mainline was already failing ("red") and the PR submission showed lots of failing tests, Trevor had to do a lot of extra work figuring out which tests failed due to his changes, and which ones were already broken (which led to 3 separate PRs now in #3075, #3076 and #3077).
I suspect core developers don't suffer from failing CI to the same extent as they simply bypass it, or have much more time on their hands
Nobody can ignore CI failures except for admins. And even for them it's not easy - go to settings, disable branch protection, push, enable branch protection. I doubt they do it often.
and can spend time learning what failures can be ignored which week and month. The fact that the CI is not green seems to be a topic where the core developers are perhaps a bit blind to the bigger picture, while non-core contributors struggle with the extra work it incurs. Also in the eyes of the wider public, a constantly failing CI erodes trust in quality.
As Daniel wrote, there's MDEV-33073 "always green buildbot", and it's a blocker, which means it *will* be done before the next release. Take a look at the 10.5 branch - I've done >30 commits in the last couple of weeks specifically to fix sporadic test failures. This will be merged up soon.
While I understand that the natural reply is "we will get to green soon" and it makes a lot of sense, I am afraid it might be a overly optimistic. We've had in the past recurring the situation that Daniel, Sergei and Monty all say the same week they want to fix all failing tests, but it only lasts for a short while and then we are back to failures on mainline CI.
This is what branch protection is for. It cannot wasn't able to do much as tests were constantly failing. Now it can
Thus, to permanently enforce have CI green on mainline branches I proposed:
I see two approaches to get to consistently green CI:
1) Stop all development and focus on just fixing these, don't continue until CI is fully green, and once it is fully green make the GitHub branch protection settings one notch stricter to not allow any new commits unless the CI is fully green so it never regresses again.
2) Disable these tests and make the rules in GitHub branch protection one notch stricter right away, and not allow any new commits unless the CI is fully green ensuring no new recurring failures are introduced.
What do other developers think about this?
I'm doing both, I fix what I can and disable the rest, creating MDEV's for disabled tests to have them fixed by the corresponding developer. Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org
Hi!
As Daniel wrote, there's MDEV-33073 "always green buildbot", and it's a blocker, which means it *will* be done before the next release.
Take a look at the 10.5 branch - I've done >30 commits in the last couple of weeks specifically to fix sporadic test failures.
This will be merged up soon.
Glad to hear!
While I understand that the natural reply is "we will get to green soon" and it makes a lot of sense, I am afraid it might be a overly optimistic. We've had in the past recurring the situation that Daniel, Sergei and Monty all say the same week they want to fix all failing tests, but it only lasts for a short while and then we are back to failures on mainline CI.
This is what branch protection is for. It cannot wasn't able to do much as tests were constantly failing. Now it can
Cool, looking forward to seeing branch protection enforcing CI stays green. - Otto
Hi! On Thu, 11 Apr 2024 at 22:26, Otto Kekäläinen <otto@kekalainen.net> wrote:
As Daniel wrote, there's MDEV-33073 "always green buildbot", and it's a blocker, which means it *will* be done before the next release.
Take a look at the 10.5 branch - I've done >30 commits in the last couple of weeks specifically to fix sporadic test failures.
This will be merged up soon.
Glad to hear!
While I understand that the natural reply is "we will get to green soon" and it makes a lot of sense, I am afraid it might be a overly optimistic. We've had in the past recurring the situation that Daniel, Sergei and Monty all say the same week they want to fix all failing tests, but it only lasts for a short while and then we are back to failures on mainline CI.
This is what branch protection is for. It cannot wasn't able to do much as tests were constantly failing. Now it can
Cool, looking forward to seeing branch protection enforcing CI stays green.
I wanted to follow up on the topic of enforcing that CI stays green by using branch protection in GitHub. The problem I have been witnessing is that new contributors such as Trevor and ParadoxV5 end up wasting a lot of time researching the failing CI only to discover that none of the failures were caused by their submission, and that the main branch already had CI failures, and essentially learn to not respect the CI status for MariaDB submissions. As Daniel and Sergei responded, the CI failures are tracked via https://jira.mariadb.org/browse/MDEV-33073 "always green buildbot" as critical bugs, and fixed as soon as possible. What I tried to argument is that it is an eternal game of catchup, and my recommendation would be to turn on the branch protection *now*, take the initial pain of having a stricter policy, but after than reaping the benefits of having a consistently green CI on main branch, and on most of the new PRs by new contributors. When I now look at https://github.com/MariaDB/server/commits/11.6/ I see a red cross on all commits since July 8th (last green one was 44af9bf). Looking at the summary by Hashim about 10 consecutive CI runs on 25b5c63 at https://github.com/MariaDB/server/pull/3425 at least the job amd64-fedora-38-last-N-failed is always failing and it would not have found its way into the codebase if a protected branch policy would require that all commits must have a passing CI status before getting pushed/merged. Thanks for considering the proposal, - Otto PS. If all MariaDB core developers tried using the GitHub PR workflow for a couple of weeks you would most likely run into this and many other smaller papercuts yourself, and a positive cycle would likely follow from such "dogfooding", which could lead to an improved external contributor experience overall.
Otto Kekäläinen via developers <developers@lists.mariadb.org> writes:
least the job amd64-fedora-38-last-N-failed is always failing and it would not have found its way into the codebase if a protected branch policy would require that all commits must have a passing CI status before getting pushed/merged.
This is not correct, unfortunately. You are assuming that CI status is consistent for a given commit. If this was the case, getting a green CI would be easy, just a matter of discipline, as you say. The problem is testcases that fail sporadically; that is, they normally pass but fail at random in a small percentage of runs. Branch protection will do nothing to prevent these failures from entering the tree. It just makes developers waste time clicking "retry" on the builders to try and get lucky on another test run. To get the failures fixed, someone has to spend the considerable time and effort required to debug the failure and understand and fix the issue. Either by debugging herself, or researching in git history and finding and working with the appropriate developer to solve the issue. Is it harsh to expect this from pull request authors? Maybe. But *someone* has to do it. If not the person who sees the failure in his test run, then who? - Kristian.
Hi!
least the job amd64-fedora-38-last-N-failed is always failing and it would not have found its way into the codebase if a protected branch policy would require that all commits must have a passing CI status before getting pushed/merged.
This is not correct, unfortunately.
You are assuming that CI status is consistent for a given commit. If this was the case, getting a green CI would be easy, just a matter of discipline, as you say.
The problem is testcases that fail sporadically; that is, they normally pass but fail at random in a small percentage of runs. Branch protection will do nothing to prevent these failures from entering the tree. It just makes developers waste time clicking "retry" on the builders to try and get lucky on another test run.
A later paragraph in my original email on August 2nd stated that it is not just a rare random thing:
When I now look at https://github.com/MariaDB/server/commits/11.6/ I see a red cross on all commits since July 8th (last green one was 44af9bf).
Some failures may be sporadic, for sure. I would still argue that applying branch protection to require CI to green will *also* help with random ones in the same way it helps by the forcing function of gatekeeping and raising the bar on test related code quality.
To get the failures fixed, someone has to spend the considerable time and effort required to debug the failure and understand and fix the issue.
Yes, somebody has, and the motivation to do so is not there, if there is no reward in doing so. Every single project I have been involved in that applied branch protection and required CI to always be green had an initial pain, but rapidly after a radically improved CI quality. By quality I mean those projects had far fever CI failures seep into the main branch, including less random failures. Allowing failures causes alert fatigue and more failures start to seep in. Having a clear gating policy and requiring all tests to pass forces everyone to quickly agree on what tests actually should be in the CI to begin with, and make sure they are well maintained. With the policy in place, basically every developer is motivated to participate in maintaining tests as otherwise during a red mainline event no developer can do anything. - Otto
Hi Otto, On Tue, Aug 6, 2024 at 6:45 AM Otto Kekäläinen via developers <developers@lists.mariadb.org> wrote:
A later paragraph in my original email on August 2nd stated that it is not just a rare random thing:
When I now look at https://github.com/MariaDB/server/commits/11.6/ I see a red cross on all commits since July 8th (last green one was 44af9bf).
As far as I can tell, there are several contributing factors to this problem. One is that there are two CI systems. The pull request workflow uses https://buildbot.mariadb.org, which is mostly based on Docker images and a somewhat newer version of the Buildbot software than the one that is managed by https://buildbot.mariadb.net and is based on virtual machines. Like you write, many core developers do not use GitHub pull requests and seem to pay attention to buildbot.mariadb.net only, ignoring any failures that would occur on buildbot.mariadb.org. For some reason, buildbot.mariadb.net has been configured so that some platforms only build "main branches". If a development branch broke things on some platform that is not covered for "non-main branches" on buildbot.mariadb.net, then such failures would typically be ignored until the change reaches the main branch. Worse, many developers do not watch the main branch status at all, before or after the commit. Paying attention to buildbot.mariadb.org would lead to a better result, because each change will be scheduled on each builder. However, not all builders are created equal: * Only some builders are mandatory for branch protection. * Only some of the mandatory and non-mandatory builders report status to GitHub. * There are builders that are "invisible" to GitHub, mainly visible in the "grid view", say, https://buildbot.mariadb.org/#/grid?branch=10.11 or https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F3030%2Fmerge for https://github.com/MariaDB/server/pull/3030/. Many reviewers and developers seem to be unaware that you should pay attention also to such "hidden failures". Some could also think that some ISA such as POWER or IBM Z (s390x) are "exotic" and not worth any attention.
Allowing failures causes alert fatigue and more failures start to seep in.
Analogous to https://en.wikipedia.org/wiki/Broken_windows_theory one would tend to ignore any failures for a given platform, say, https://buildbot.mariadb.org/#/builders/588 (amd64-debian-12-asan-ubsan) always seems to fail, therefore I will ignore it. It might help if experimental builders were separated in the grid view. Based on the currently latest failure https://buildbot.mariadb.org/#/builders/588/builds/7718 this may be an issue with that particular builder. However, without a deeper investigation I would not claim so, because based on https://jira.mariadb.org/browse/MDEV-26272 and many related tickets I know that clang -fsanitize=undefined is much stricter than the corresponding GCC option.
Having a clear gating policy and requiring all tests to pass forces everyone to quickly agree on what tests actually should be in the CI to begin with, and make sure they are well maintained. With the policy in place, basically every developer is motivated to participate in maintaining tests as otherwise during a red mainline event no developer can do anything.
I agree. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Hello Marko, Otto, Kristian, developers, I share Otto's experience that the changes are hard at first, but the rewards come quickly. Sporadic test failures get investigated and resolved, or sometimes split out into a non-gating supplemental test-suite. Developers find sensible solutions. As Marko illuminates, MariaDB server's development is unusual. That said, it is slowly shifting towards practices more typical of other FOSS codebases. It is clear that there will be challenges in implementing an "always be green" policy. With the challenges in mind, how can we get there? Cheers, -Eric
I completely agree, the end goal is definitely worth it. It will be a little difficult, because, as Marko points out, not everyone uses PRs. So, unless everyone uses the same methodology, with which the CI has voting rights for every commit (as with most other large open source projects), it will always be a tail to chase. That being said, the failing tests situation can be made a lot better. But, it requires someone to lead that effort. Kind Regards Andrew On 06/08/2024 07:09, Eric Herman via developers wrote:
Hello Marko, Otto, Kristian, developers,
I share Otto's experience that the changes are hard at first, but the rewards come quickly. Sporadic test failures get investigated and resolved, or sometimes split out into a non-gating supplemental test-suite. Developers find sensible solutions.
As Marko illuminates, MariaDB server's development is unusual. That said, it is slowly shifting towards practices more typical of other FOSS codebases.
It is clear that there will be challenges in implementing an "always be green" policy.
With the challenges in mind, how can we get there?
Cheers, -Eric
_______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-leave@lists.mariadb.org
-- Andrew (LinuxJedi) Hutchings Chief Contributions Officer MariaDB Foundation
On Thu, 8 Aug 2024 at 04:54, Andrew Hutchings via developers <developers@lists.mariadb.org> wrote:
I completely agree, the end goal is definitely worth it.
It will be a little difficult, because, as Marko points out, not everyone uses PRs. So, unless everyone uses the same methodology, with which the CI has voting rights for every commit (as with most other large open source projects), it will always be a tail to chase.
Using protected branches in GitHub is a feature of the git receive hook. It does *not require using Pull Requests*. It just requires that a commit has been in GitHub on any branch and passed CI before it the receive-hook accepts it being pushed to the main branch.
Hi Otto, On 03/08/2024 06:29, Otto Kekäläinen via developers wrote:
When I now look at https://github.com/MariaDB/server/commits/11.6/ I see a red cross on all commits since July 8th (last green one was 44af9bf). Looking at the summary by Hashim about 10 consecutive CI runs on 25b5c63 at https://github.com/MariaDB/server/pull/3425 at least the job amd64-fedora-38-last-N-failed is always failing and it would not have found its way into the codebase if a protected branch policy would require that all commits must have a passing CI status before getting pushed/merged.
Discussion aside, this particular failure was fixed last week (and I think merged up this week). It was the unfortunate side-effect of a behaviour change in global status variables in 11.5. Kind Regards -- Andrew (LinuxJedi) Hutchings Chief Contributions Officer MariaDB Foundation
participants (7)
-
Andrew Hutchings
-
Daniel Black
-
Eric Herman
-
Kristian Nielsen
-
Marko Mäkelä
-
Otto Kekäläinen
-
Sergei Golubchik