https://bugzilla.redhat.com/show_bug.cgi?id=813964 (Red Hat Enterprise Linux 6)
***Test Type: IPA System Longevity Test ***Issue: I'm running a Rhel6.3 nightly build IPA longevity test. The IPA Master dirsvr generated a seg-fault resulting in IPA client test load failures after approximate 81 hours of runtime. ***Actual results: 1. The IPA Master DirSvr generated a seg-fault. 2. After an IPA restart the system were left in a state where users were not in sync master to slave when searching for users using 'ipa user-find lname' commands. IPA client also became unable to run ipa commands as well. 3. At light vuser thread loads the admin throughput was easily saturated. 4. No fail-over occurred to the Slave, but not sure how it was implemented to work. ***Expected results: Expected Ipa to continue executing the virtual test client admin and authn load for the extended period of time, at a certain level of throughput based of the user population in the DirSvr. ****Servers Symptoms: IPA Master: Status indicating Directory Server had STOPPED kinit admin - failure getting initial credentials ipa user-find - failure UI - https://sti-high-1.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible) No core files available No abrt crashes detected Nothing under /var/spool/abrt other than abrt-db IPA Slave: Status indicating all is RUNNING kinit admin - working ipa user-find - functional UI - https://sti-high-2.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible) IPA Client: kinit admin - failure ipa user-find - failures UI - https://sti-high-X.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible) IPA admin test clients began failing load IPA authn test clients began failing load For the run, I increased the load from 5 to 10 virtual user threads for both authn and ipa admin use cases (light load). Increasing the admin load however never increased the transaction thoughput but hit a saturated level. The systems running the test environment are high end machines listed below. After the failure and restart of the IPA servers, the systems were left in a strange state where the IPA master and slave users were not in sync and my IPA client had issues with basic kinit / ipa commands. ***Next Steps: To reproduce and get the test environment back to a known state, I re -provisioned the test environment, built a 1k user population and enabled debugging on the systems. No adjustments were made out of the box after the system was installed to accommodate any performance issues. Again I'm running the load so far so good at 24hrs...Had been conversing with dev, Rich Megginson on the issues at hand. ***Repeatability: With the released version of IPA on Rhel6.2 I had successfully caused seg faults. These issues had been seen before and defects written against by others. These defects had been resolved as far as I know, so the intent now was to test against the Rhel6.3 IPA nightly specifically. This seg fault issue has happened once so far with this version (this defect). Attempts now are to enable debug and rerun the tests once more to collect core files for dev for debugging the issues. ****System Test Env: IPA Master, Slave an Client Red Hat Enterprise Linux Server release 6.3 Beta (Santiago) Component: rpm -qif ipa Name : ipa-server Relocations: (not relocatable) Version : 2.2.0 Vendor: Red Hat, Inc. Release : 9.el6 Build Date: Tue 10 Apr 2012 08:39:54 PM EDT Install Date: Tue 17 Apr 2012 11:02:23 AM EDT Build Host: hs20-bc2-5.build.redhat.com Group : System Environment/Base Source RPM: ipa-2.2.0-9.el6.src.rpm Size : 3771583 License: GPLv3+ rpm -qi 389-ds-base Name : 389-ds-base Relocations: (not relocatable) Version : 1.2.10.2 Vendor: Red Hat, Inc. Release : 6.el6 Build Date: Tue 10 Apr 2012 04:31:17 PM EDT Install Date: Tue 17 Apr 2012 11:02:23 AM EDT Build Host: hs20-bc2-5.build.redhat.com Group : System Environment/Daemons Source RPM: 389-ds-base-1.2.10.2-6.el6.src.rpm Size : 4850666 Steps to Reproduce: 1. Provision IPA Nightly Rhel6.3 Master 2. Provision IPA Nightly Rhel6.3 Slave 3. Provision IPA Nightly Rhel6.3 Client 4. Apply and run kerb authn and ipa admin load through STI to collect system test data and drive the tests at defined schedules 5. Increase load to 10 vusers for ipa admin 6. Increase load to 10 vusers for authn ***Additional info: ****Longevity Test Failure: Run Identifier: run1-J216284 Start Date: 2012-04-11 15:49:00 Failure Date: 2012-04-16 01:50:00 ****Test load: Test Failures after 81 hours of Load. Both Authn Kerberose and ipa administrative load now failing. IPA Master indicating Dir Server is in a Stopped State. system logs indicate "ns-slapd[14134]: segfault at 7fac485d40cb ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.s" ****Logs: /var/log/messages snip Apr 16 00:42:01 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Request is a replay) Apr 16 00:42:07 sti-high-1 logger: 2012-04-16 00:42:07 /usr/bin/rhts-test-runner.sh 1210506 400080 hearbeat... Apr 16 00:46:22 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Request is a replay) Apr 16 00:48:08 sti-high-1 logger: 2012-04-16 00:48:07 /usr/bin/rhts-test-runner.sh 1210506 400440 hearbeat... Apr 16 00:48:50 sti-high-1 kernel: ns-slapd[14134]: segfault at 7fac485d40cb ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.so[7fab6f80f000+8e000] Apr 16 00:48:50 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (KDC returned error string: PROCESS_TGS) Apr 16 00:48:50 sti-high-1 named[16591]: LDAP error: Can't contact LDAP server Apr 16 00:48:50 sti-high-1 named[16591]: connection to the LDAP server was lost Apr 16 00:48:50 sti-high-1 named[16591]: bind to LDAP server failed: Can't contact LDAP server /var/log/DirSvr/slapd-TestRelm-COM/errors snip [13/Apr/2012:13:26:36 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized)) errno 110 (Connection timed out) [13/Apr/2012:13:26:36 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: error -2 (Local error) [13/Apr/2012:13:26:36 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized)) [13/Apr/2012:13:26:40 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth resumed [15/Apr/2012:07:33:33 -0400] entryrdn-index - _entryrdn_put_data: Adding the parent link (P28354) failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) [16/Apr/2012:00:45:46 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history! [16/Apr/2012:00:46:53 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history! [16/Apr/2012:00:47:56 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history! ****Test Load: IPA Admin- Use Case: Test 1: - Positive Test Scenario - Reached 10 Virtual Users - Ipa usecase - (find, delete), add, find, disable, enable, modify, then delete - 1 sec delay per thread - Total users cycling in test is 30 Kerbersose Authn: Test 1: - Positive Test Scenario - Reached 10 Virtual Users - 1 sec delay per thread - Total users in test 1000 ****Beaker Provisioned: Beaker Provisioned: Job J216284 ****Hardware in System Test Environment: Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors sti-high-1.testrelm.com 10.16.24.27 IPA Master - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_stihigh1-lv_root 50G 5.1G 42G 11% / tmpfs 7.8G 288K 7.8G 1% /dev/shm /dev/sda1 485M 37M 423M 8% /boot /dev/mapper/vg_stihigh1-lv_home 1.8T 196M 1.7T 1% /home sti-high-2.testrelm.com 10.16.24.29 IPA Slave- Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_stihigh2-lv_root 50G 4.5G 43G 10% / tmpfs 7.8G 61M 7.8G 1% /dev/shm /dev/sda1 485M 37M 423M 8% /boot /dev/mapper/vg_stihigh2-lv_home 1.8T 196M 1.7T 1% /home sti-high-2.testrelm.com 10.16.24.31 IPA Client - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_stihigh3-lv_root 50G 3.3G 44G 7% / tmpfs 7.8G 260K 7.8G 1% /dev/shm /dev/sda1 485M 37M 423M 8% /boot /dev/mapper/vg_stihigh3-lv_home 1.8T 197M 1.7T 1% /home
Now I'm getting duplicates - seems that doing the retry upon the -1 index gets the last record again as the first record of the next batch - going to have to screen out dups
0001-Ticket-347-IPA-dirsvr-seg-fault-during-system-longev.patch 0001-Ticket-347-IPA-dirsvr-seg-fault-during-system-longev.patch
To ssh://git.fedorahosted.org/git/389/ds.git e689473..0cc661f master -> master commit changeset:0cc661f/389-ds-base Author: Rich Megginson rmeggins@redhat.com Date: Fri Apr 20 20:15:21 2012 -0600
To ssh://git.fedorahosted.org/git/389/ds.git 987edff..5d45dd8 389-ds-base-1.2.10 -> 389-ds-base-1.2.10 commit changeset:5d45dd8/389-ds-base Author: Rich Megginson rmeggins@redhat.com Date: Fri Apr 20 20:15:21 2012 -0600
Added initial screened field value.
Metadata Update from @rmeggins: - Issue assigned to rmeggins - Issue set to the milestone: 1.2.10.7
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/347
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.