#9523 Nightly test failure (389ds/rawhide) in test_backup_and_restore_TestBackupAndRestoreWithReplica
Closed: fixed a year ago by frenaud. Opened a year ago by frenaud.

The nightly test test_backup_and_restore_TestBackupAndRestoreWithReplica is failing when executed with the copr repo @389ds/389-ds-base-nightly.
This copr repo contains a build of 389-ds-base main branch.

Example of test failure: PR #3346 with the following logs and report.

Test scenario:
- install a server with CA, DNS
- install a replica
- backup server
- simulate master crash: stop replica, uninstall master, start replica
- call ipa-restore on the master with the backup done above
- reinitialize replica
- install CA on the replica
- install a 2nd replica with CA, using master

The installation of the 2nd replica fails during the CA installation ipareplica-install.log:

SEVERE: Unable to generate certificate: Request 4 - Server Internal Error
java.io.IOException: Unable to generate certificate: Request 4 - Server Internal Error
    at com.netscape.certsrv.ca.CACertClient.submitRequest(CACertClient.java:260)
    at com.netscape.cmstools.ca.CACertRequestSubmitCLI.execute(CACertRequestSubmitCLI.java:315)
    at org.dogtagpki.cli.CommandCLI.execute(CommandCLI.java:58)
    at org.dogtagpki.cli.CLI.execute(CLI.java:353)
    at org.dogtagpki.cli.CLI.execute(CLI.java:353)
    at org.dogtagpki.cli.CLI.execute(CLI.java:353)
    at com.netscape.cmstools.cli.SubsystemCLI.execute(SubsystemCLI.java:79)
    at org.dogtagpki.cli.CLI.execute(CLI.java:353)
    at com.netscape.cmstools.cli.MainCLI.execute(MainCLI.java:659)
    at com.netscape.cmstools.cli.MainCLI.main(MainCLI.java:698)
ERROR: CalledProcessError: Command '['pki', '-d', '/etc/pki/pki-tomcat/alias', '-f', '/etc/pki/pki-tomcat/password.conf', '-U', 'https://master.ipa.test:443', '--ignore-banner', 'ca-cert-request-submit', '--request-type', 'pkcs10', '--csr-file', '/tmp/tmpvk3o0rt0/request.csr', '--profile', 'caInternalAuthServerCert', '--subject', 'cn=replica1.ipa.test,O=IPA.TEST', '--requestor', 'CA-replica1.ipa.test-8443', '--install-token', '/tmp/tmpvk3o0rt0/install-token', '--output-format', 'PEM', '--debug']' returned non-zero exit status 255.
  File "/usr/lib/python3.12/site-packages/pki/server/pkispawn.py", line 588, in main
    scriptlet.spawn(deployer)
  File "/usr/lib/python3.12/site-packages/pki/server/deployment/scriptlets/configuration.py", line 248, in spawn
    system_certs = deployer.setup_system_certs(nssdb, subsystem)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/pki/server/deployment/__init__.py", line 2279, in setup_system_certs
    self.setup_system_cert(nssdb, subsystem, tag, system_cert, request)
  File "/usr/lib/python3.12/site-packages/pki/server/deployment/__init__.py", line 2198, in setup_system_cert
    pem_cert = self.request_cert(
               ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/pki/server/deployment/__init__.py", line 2405, in request_cert
    result = subprocess.run(cmd, stdout=subprocess.PIPE, check=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,

The certificate server on the master fails to issue a new cert ca/debug.log:

2024-01-28 10:09:24 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-8] INFO: LDAPSession: Adding cn=4,ou=ca, ou=requests,o=ipaca
2024-01-28 10:09:24 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-8] SEVERE: RequestRepository: Unable to add LDAP entry: Already exists
Unable to add LDAP entry: Already exists

It looks like PKI is trying to reuse an existing request id.
PKI is using VLV searches to find the last request id / last serial id already used in the range and it seems that the VLV search does not return the right value.

PKI is currently working on the replacement of VLV searches with PagedResults: #4265 Replace VLV with Paged Results


There is another test which is failing and may be related to the VLV issue: test_replica_promotion_TestHiddenReplicaPromotion

The test is installing a server + a replica with CA and calling ipa-healthcheck. One of the checks is ensuring that PKI server is answering to the REST api ca/rest/certs/search?size=3 but only 2 certs are returned. The PKI server is internally making a VLV search.

With Berkeley DB we see the following logs in 389ds:

[27/Jan/2024:01:40:25.211920918 +0000] conn=8 op=9 SRCH base="ou=certificateRepository,ou=ca,o=ipaca" scope=1 filter="(certStatus=*)" attrs=ALL
[27/Jan/2024:01:40:25.213235651 +0000] conn=8 op=9 SORT serialno 
[27/Jan/2024:01:40:25.213926712 +0000] conn=8 op=9 VLV 0:3:A 18:18 (0)
[27/Jan/2024:01:40:25.214740494 +0000] conn=8 op=9 RESULT err=0 tag=101 nentries=1 wtime=0.000139212 optime=0.002822135 etime=0.002959065
[27/Jan/2024:01:40:25.222262424 +0000] conn=8 op=10 SRCH base="ou=certificateRepository,ou=ca,o=ipaca" scope=1 filter="(certStatus=*)" attrs=ALL
[27/Jan/2024:01:40:25.223354816 +0000] conn=8 op=10 SORT serialno 
[27/Jan/2024:01:40:25.224056575 +0000] conn=8 op=10 VLV 0:3:0:18 1:18 (0)
[27/Jan/2024:01:40:25.225488354 +0000] conn=8 op=10 RESULT err=0 tag=101 nentries=4 wtime=0.006663007 optime=0.003229228 etime=0.009889556

but with LMDB we see:

[28/Jan/2024:10:07:29.137423756 +0000] conn=7 op=9 SRCH base="ou=certificateRepository,ou=ca,o=ipaca" scope=1 filter="(certStatus=*)" attrs=ALL
[28/Jan/2024:10:07:29.138724095 +0000] conn=7 op=9 SORT serialno 
[28/Jan/2024:10:07:29.139525109 +0000] conn=7 op=9 VLV 0:3:A 2:2 (0)
[28/Jan/2024:10:07:29.140347124 +0000] conn=7 op=9 RESULT err=0 tag=101 nentries=1 wtime=0.000114099 optime=0.002925241 etime=0.003037443
[28/Jan/2024:10:07:29.144040784 +0000] conn=7 op=10 SRCH base="ou=certificateRepository,ou=ca,o=ipaca" scope=1 filter="(certStatus=*)" attrs=ALL
[28/Jan/2024:10:07:29.144732331 +0000] conn=7 op=10 SORT serialno 
[28/Jan/2024:10:07:29.145331170 +0000] conn=7 op=10 VLV 0:3:0:2 1:2 (0)
[28/Jan/2024:10:07:29.146374255 +0000] conn=7 op=10 RESULT err=0 tag=101 nentries=2 wtime=0.000092628 optime=0.002334183 etime=0.002424951

Logs can be seen in this repo.

After setting ipa server + ipa replica with CA,
dbscan -f "/var/lib/dirsrv/slapd-IPA-TEST/db//ipaca/vlv#allcertspkitomcatindex.db" and
dbscan -f "/var/lib/dirsrv/slapd-IPA-TEST/db//ipaca/vlv#allnonrevokedcertspkitomcatindex.db"
differs on the replica and the server (only the 2 last keys are present on the replica)

And after:
dsconf IPA-TEST backend export -r ipaca -l /var/lib/dirsrv/slapd-IPA-TEST/ldif/ipaca.ldif
dsconf IPA-TEST backend import --help # ipaca -l /var/lib/dirsrv/slapd-IPA-TEST/ldif/ipaca.ldif
There in no keys ion the vlv indexes
Still no keys after running on ipaserver:
dsconf IPA-TEST repl-agmt init --suffix o=ipaca caToipareplica.ipa.test
And still no keys after:
dsconf IPA-TEST backend vlv-index reindex --parent="allCerts-pki-tomcat" ipaca
Index task index_vlv_02122024_151813 completed successfully
Successfully reindexed VLV indexes

Something looks seriously broken in the import pipeline ...

FYI: Last Thursday I fixed https://github.com/389ds/389-ds-base/issues/6057 (several issues about vlv index when reindexing/importing/(including the replication bulk import ...)
that fixed the healthcheck issue on master + replica with CA test
But apparently in this weekend test we still see some problems about VLVs ...
Typically test_backup_and_restore_TestBackupReinstallRestoreWithKRA: issue in ipa cert-find after restore in http://freeipa-org-pr-ci.s3-website.eu-central-1.amazonaws.com/jobs/e754a7a2-d351-11ee-be09-fa163e635ad1/

The test test_ipahealthcheck.py::TestIpaHealthCheck::test_source_pki_server_clones_connectivity_and_data is also failing in rawhide. Test scenario: install master + replica with CA instance, run ipa-healthcheck --source pki.server.healthcheck.clones.connectivity_and_data --check ClonesConnectivyAndDataCheck:

Internal server error CA clone problem reading data. Host: replica.ipatest.test Port: 443
[
  {
    "source": "pki.server.healthcheck.clones.connectivity_and_data",
    "check": "ClonesConnectivyAndDataCheck",
    "result": "ERROR",
    "uuid": "8e234deb-f59d-43eb-b279-388fe4c9c5bb",
    "when": "20240403153900Z",
    "duration": "0.061090",
    "kw": {
      "status": "ERROR:  pki-tomcat : Internal error testing CA clone. Host: replica.ipatest.test Port: 443"
    }
  }
]

The issue happens because the check is retrieving the certificates from LDAP using a VLV ext and obtains only 2 results while 3 are expected. This can be seen in 389ds logs on the replica:

[03/Apr/2024:11:39:00.888097557 -0400] conn=22 op=42 SRCH base="ou=certificateRepository,ou=ca,o=ipaca" scope=1 filter="(certStatus=*)" attrs=ALL
[03/Apr/2024:11:39:00.888347105 -0400] conn=22 op=42 SORT serialno 
[03/Apr/2024:11:39:00.888354479 -0400] conn=22 op=42 VLV 0:3:A 2:2 (0)
[03/Apr/2024:11:39:00.888478918 -0400] conn=22 op=42 RESULT err=0 tag=101 nentries=1 wtime=0.000139867 optime=0.000383293 etime=0.000522043
[03/Apr/2024:11:39:00.890132719 -0400] conn=22 op=43 SRCH base="ou=certificateRepository,ou=ca,o=ipaca" scope=1 filter="(certStatus=*)" attrs=ALL
[03/Apr/2024:11:39:00.890276647 -0400] conn=22 op=43 SORT serialno 
[03/Apr/2024:11:39:00.890282696 -0400] conn=22 op=43 VLV 0:3:0:2 1:2 (0)
[03/Apr/2024:11:39:00.890432893 -0400] conn=22 op=43 RESULT err=0 tag=101 nentries=2 wtime=0.001636593 optime=0.000301123 etime=0.001936703

FYI: There is still a known issue about vlv whose fix is waiting for upstream review
it should fix test_backup_and_restore_TestBackupAndRestoreWithKRA
review but what worries me is that the symptoms are different.
(vlv index with one level scope is empty after reindex/import so vlv search returns no entry at all)
That said, it is possible that the two entries are created after the last import and that would explain that only these data are returned.
So let see if https://github.com/389ds/389-ds-base/pull/6137 will fix the problem but I would not be surprised if there is yet another issue in vlv area. Looks like never ending story ... Sigh !

As I was afraid, there was another issue: https://github.com/389ds/389-ds-base/issues/6141, but luckily, the troubleshooting and the fix were simple,
now the freeipa tests are passing.
FYI we plan to do very shortly a rawhide build and an f40 update including all the vlv issues.

Scenario backup-uninstall-restore-install-replica-with-CA tested on f40 with https://bodhi.fedoraproject.org/updates/FEDORA-2024-eca2c959b1, works well.
Also tested on rawhide with https://bodhi.fedoraproject.org/updates/FEDORA-2024-ad349ed2cc, works well.

Closing as fixed.

Metadata Update from @frenaud:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

a year ago

Log in to comment on this ticket.

Metadata