#347 IPA dirsvr seg-fault during system longevity test
Closed: Fixed None Opened 7 years ago by rmeggins.

https://bugzilla.redhat.com/show_bug.cgi?id=813964 (Red Hat Enterprise Linux 6)

***Test Type:
IPA System Longevity Test

***Issue:
I'm running a Rhel6.3 nightly build IPA longevity test. The IPA Master dirsvr
generated a seg-fault resulting in IPA client test load failures after
approximate 81 hours of runtime.

***Actual results:
1. The IPA Master DirSvr generated a seg-fault.
2. After an IPA restart the system were left in a state where users were not in
sync master to slave when searching for users using 'ipa user-find lname'
commands.  IPA client also became unable to run ipa commands as well.
3. At light vuser thread loads the admin throughput was easily saturated.
4. No fail-over occurred to the Slave, but not sure how it was implemented to
work.

***Expected results:
Expected Ipa to continue executing the virtual test client admin and authn load
for the extended period of time, at a certain level of throughput based of the
user population in the DirSvr.

****Servers Symptoms:
IPA Master:
        Status indicating Directory Server had STOPPED
        kinit admin - failure getting initial credentials
        ipa user-find  - failure
        UI - https://sti-high-1.testrelm.com/ipa/ui/ - Server Not Found (Not
Accessible)
        No core files available
        No abrt crashes detected
        Nothing under /var/spool/abrt other than abrt-db

IPA Slave:
        Status indicating all is RUNNING
        kinit admin - working
        ipa user-find  - functional
        UI - https://sti-high-2.testrelm.com/ipa/ui/ - Server Not Found (Not
Accessible)

IPA Client:
        kinit admin - failure
        ipa user-find  - failures
        UI - https://sti-high-X.testrelm.com/ipa/ui/ - Server Not Found (Not
Accessible)
        IPA admin test clients began failing load
        IPA authn test clients began failing load


For the run, I increased the load from 5 to 10 virtual user threads for both
authn and ipa admin use cases (light load).  Increasing the admin load however
never increased the transaction thoughput but hit a saturated level.  The
systems running the test environment are high end machines listed below.  After
the failure and restart of the IPA servers, the systems were left in a strange
state where the IPA master and slave users were not in sync and my IPA client
had issues with basic kinit / ipa commands.

***Next Steps:
To reproduce and get the test environment back to a known state,  I re
-provisioned the test environment, built a 1k user population and enabled
debugging on the systems.  No adjustments were made out of the box after the
system was installed to accommodate any performance issues.  Again I'm running
the load so far so good at 24hrs...Had been conversing with dev, Rich Megginson
on the issues at hand.

***Repeatability:
With the released version of IPA on Rhel6.2 I had successfully caused seg
faults.  These issues had been seen before and defects written against by
others.  These defects had been resolved as far as I know, so the intent now
was to test against the Rhel6.3 IPA nightly specifically.  This seg fault issue
has happened once so far with this version (this defect).  Attempts now are to
enable debug and rerun the tests once more to collect core files for dev for
debugging the issues.

****System Test Env:
IPA Master, Slave an Client
Red Hat Enterprise Linux Server release 6.3 Beta (Santiago)

Component:
rpm -qif ipa
Name        : ipa-server                   Relocations: (not relocatable)
Version     : 2.2.0                             Vendor: Red Hat, Inc.
Release     : 9.el6                         Build Date: Tue 10 Apr 2012
08:39:54 PM EDT
Install Date: Tue 17 Apr 2012 11:02:23 AM EDT      Build Host:
hs20-bc2-5.build.redhat.com
Group       : System Environment/Base       Source RPM: ipa-2.2.0-9.el6.src.rpm
Size        : 3771583                          License: GPLv3+

rpm -qi 389-ds-base
Name        : 389-ds-base                  Relocations: (not relocatable)
Version     : 1.2.10.2                          Vendor: Red Hat, Inc.
Release     : 6.el6                         Build Date: Tue 10 Apr 2012
04:31:17 PM EDT
Install Date: Tue 17 Apr 2012 11:02:23 AM EDT      Build Host:
hs20-bc2-5.build.redhat.com
Group       : System Environment/Daemons    Source RPM:
389-ds-base-1.2.10.2-6.el6.src.rpm
Size        : 4850666

Steps to Reproduce:
1. Provision IPA Nightly Rhel6.3 Master
2. Provision IPA Nightly Rhel6.3 Slave
3. Provision IPA Nightly Rhel6.3 Client
4. Apply and run kerb authn and ipa admin load through STI to collect system
test data and drive the tests at defined schedules
5. Increase load to 10 vusers for ipa admin
6. Increase load to 10 vusers for authn

***Additional info:
****Longevity Test Failure:
Run Identifier: run1-J216284
Start Date:         2012-04-11 15:49:00
Failure Date:        2012-04-16 01:50:00

****Test load:
Test Failures after 81 hours of Load. Both Authn Kerberose and ipa
administrative load now failing. IPA Master indicating Dir Server is in a
Stopped State. system logs indicate "ns-slapd[14134]: segfault at 7fac485d40cb
ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.s"

****Logs:
/var/log/messages snip
Apr 16 00:42:01 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor
code may provide more information (Request is a replay)
Apr 16 00:42:07 sti-high-1 logger: 2012-04-16 00:42:07
/usr/bin/rhts-test-runner.sh 1210506 400080 hearbeat...
Apr 16 00:46:22 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor
code may provide more information (Request is a replay)
Apr 16 00:48:08 sti-high-1 logger: 2012-04-16 00:48:07
/usr/bin/rhts-test-runner.sh 1210506 400440 hearbeat...
Apr 16 00:48:50 sti-high-1 kernel: ns-slapd[14134]: segfault at 7fac485d40cb ip
00007fab6f83f93d sp 00007fab485d4010 error 4 in
libback-ldbm.so[7fab6f80f000+8e000]
Apr 16 00:48:50 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor
code may provide more information (KDC returned error string: PROCESS_TGS)
Apr 16 00:48:50 sti-high-1 named[16591]: LDAP error: Can't contact LDAP server
Apr 16 00:48:50 sti-high-1 named[16591]: connection to the LDAP server was lost
Apr 16 00:48:50 sti-high-1 named[16591]: bind to LDAP server failed: Can't
contact LDAP server

/var/log/DirSvr/slapd-TestRelm-COM/errors snip
[13/Apr/2012:13:26:36 -0400] slapd_ldap_sasl_interactive_bind - Error: could
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local
error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied
(Hostname cannot be canonicalized)) errno 110 (Connection timed out)
[13/Apr/2012:13:26:36 -0400] slapi_ldap_bind - Error: could not perform
interactive bind for id [] mech [GSSAPI]: error -2 (Local error)
[13/Apr/2012:13:26:36 -0400] NSMMReplicationPlugin -
agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with
GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure:
GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized))
[13/Apr/2012:13:26:40 -0400] NSMMReplicationPlugin -
agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with
GSSAPI auth resumed
[15/Apr/2012:07:33:33 -0400] entryrdn-index - _entryrdn_put_data: Adding the
parent link (P28354) failed: DB_LOCK_DEADLOCK: Locker killed to resolve a
deadlock (-30994)
[16/Apr/2012:00:45:46 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c,
line 926]: failed to generate new password history!
[16/Apr/2012:00:46:53 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c,
line 926]: failed to generate new password history!
[16/Apr/2012:00:47:56 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c,
line 926]: failed to generate new password history!

****Test Load:

IPA Admin- Use Case:
Test 1:
- Positive Test Scenario
- Reached 10 Virtual Users
- Ipa usecase - (find, delete), add, find, disable, enable, modify, then delete
- 1 sec delay per thread
- Total users cycling in test is 30

Kerbersose Authn:
Test 1:
- Positive Test Scenario
- Reached 10 Virtual Users
- 1 sec delay per thread
- Total users in test 1000

****Beaker Provisioned:
Beaker Provisioned: Job J216284

****Hardware in System Test Environment:
Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors

sti-high-1.testrelm.com 10.16.24.27
IPA Master - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors,
16GB Ram, x86-64
        Filesystem                            Size  Used Avail Use% Mounted on
        /dev/mapper/vg_stihigh1-lv_root        50G  5.1G   42G  11% /
        tmpfs                                 7.8G  288K  7.8G   1% /dev/shm
        /dev/sda1                             485M   37M  423M   8% /boot
        /dev/mapper/vg_stihigh1-lv_home        1.8T  196M  1.7T   1% /home

sti-high-2.testrelm.com 10.16.24.29
IPA Slave- Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors,
16GB Ram, x86-64
        Filesystem                            Size  Used Avail Use% Mounted on
        /dev/mapper/vg_stihigh2-lv_root        50G  4.5G   43G  10% /
        tmpfs                                 7.8G   61M  7.8G   1% /dev/shm
        /dev/sda1                             485M   37M  423M   8% /boot
        /dev/mapper/vg_stihigh2-lv_home        1.8T  196M  1.7T   1% /home

sti-high-2.testrelm.com 10.16.24.31
IPA Client - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors,
16GB Ram, x86-64
        Filesystem                            Size  Used Avail Use% Mounted on
        /dev/mapper/vg_stihigh3-lv_root        50G  3.3G   44G   7% /
        tmpfs                                 7.8G  260K  7.8G   1% /dev/shm
        /dev/sda1                             485M   37M  423M   8% /boot
        /dev/mapper/vg_stihigh3-lv_home        1.8T  197M  1.7T   1% /home

Now I'm getting duplicates - seems that doing the retry upon the -1 index gets the last record again as the first record of the next batch - going to have to screen out dups

0001-Ticket-347-IPA-dirsvr-seg-fault-during-system-longev.patch
0001-Ticket-347-IPA-dirsvr-seg-fault-during-system-longev.patch

To ssh://git.fedorahosted.org/git/389/ds.git
e689473..0cc661f master -> master
commit changeset:0cc661f/389-ds-base
Author: Rich Megginson rmeggins@redhat.com
Date: Fri Apr 20 20:15:21 2012 -0600

To ssh://git.fedorahosted.org/git/389/ds.git
987edff..5d45dd8 389-ds-base-1.2.10 -> 389-ds-base-1.2.10
commit changeset:5d45dd8/389-ds-base
Author: Rich Megginson rmeggins@redhat.com
Date: Fri Apr 20 20:15:21 2012 -0600

Added initial screened field value.

Metadata Update from @rmeggins:
- Issue assigned to rmeggins
- Issue set to the milestone: 1.2.10.7

2 years ago

Login to comment on this ticket.

Metadata