#48925 nunc-stans: ns-slapd crashes during startup with SIGILL on AMD Opteron 280
Closed: wontfix None Opened 5 years ago by nhosoi.

Description of problem:
With nunc-stans enabled ns-slapd crashes at startup on a machine with AMD
Opteron 280 CPU:
==28269== Process terminating with default action of signal 4 (SIGILL): dumping core
==28269==  Illegal opcode at address 0x516F04A
==28269==    at 0x516F04A: abstraction_dcas (abstraction_dcas.c:179)
==28269==    by 0x516EF56: freelist_push (freelist_pop_push.c:84)
==28269==    by 0x516EC7C: freelist_new_elements (freelist_new.c:67)
==28269==    by 0x516ED40: freelist_new (freelist_new.c:31)
==28269==    by 0x516E48C: stack_new (stack_new.c:21)
==28269==    by 0x516CEB6: ns_thrpool_new (ns_thrpool.c:900)
==28269==    by 0x128EA9: slapd_daemon (daemon.c:1222)
==28269==    by 0x119ACB: main (main.c:1117)

nunc-stans uses libflds, which has this in the source:
src/abstraction/abstraction_dcas.c:183

   179      __asm__ __volatile__
   180      (
   181        "xchg %%rsi, %%rbx;"  // swap RBI and RBX
   182        "lock;"               // make cmpxchg16b atomic
   183        "cmpxchg16b %0;"      // cmpxchg16b sets ZF on success
   184        "setz       %3;"      // if ZF set, set cas_result to 1
   185        "xchg %%rbx, %%rsi;"  // re-swap RBI and RBX
   186
   187        // output
   188        : "+m" (*(volatile atom_t (*)[2]) destination), "+a" (*compare),
"+d" (*(compare+1)), "=q" (cas_result)
   189
   190        // input
   191        : "S" (*exchange), "c" (*(exchange+1))
   192
   193        // clobbered
   194        : "cc", "memory"
   195      );

cmpxchg16b is not supported by some AMD proccessors
https://en.wikipedia.org/wiki/X86-64:
Early AMD64 processors (typically on Socket 939 and 940) lacked the CMPXCHG16B
instruction, which is an extension of the CMPXCHG8B instruction present on most
post-80486 processors. Similar to CMPXCHG8B, CMPXCHG16B allows for atomic
operations on octal words. This is useful for parallel algorithms that use
compare and swap on data larger than the size of a pointer, common in lock-free
and wait-free algorithms. Without CMPXCHG16B one must use workarounds, such as
a critical section or alternative lock-free approaches.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.5.10-3.el7.x86_64

How reproducible:
always

Steps to Reproduce:
0. Use a machine with AMD Opteron 280
1. Enable nunc-stans
2. Start ns-slapd

Actual results:
Server crashes with SIGILL

Expected results:
Server should startup successfully.

Additional info:
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 280
stepping        : 2
microcode       : 0x4d
cpu MHz         : 2405.487
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext
3dnow art rep_good nopl extd_apicid pni lahf_lm cmp_legacy
bogomips        : 4810.97
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Hi William,

I'm setting you as an owner since you own the corresponding bug.

Thanks!
--noriko

This is a really rare, and obscure platform. The author of liblfds (and I) think that it's probably difficult to solve this as gcc doesn't show a macro when the system doesn't support DWCAS.

As a result, we think we want to set this as "known issue", but we cannot easily fix unless there is a really huge request to.

Considering MS windows after 8.1 require DWCAS, and that most users are on newer hardware than this, I think we won't fix this issue.

We won't support slapd running with nunc-stans on those platforms.

We need some sort of customer facing documentation that says "if you run command X and it tells you Y, you cannot enable nunc-stans".

cat /proc/cpuinfo | grep cx16

That should do it.

Looks good. I have a question... The DS still can run without nunc-stans if we set
nsslapd-enable-nunc-stans: off", can't it? Or it's always enabled and there's no way to "disable" it?

I thought if the administrator is allowed to select one of them, we could mention it here...
{{{
897 printf("ERROR: This system does not support CMPXCHG16B instruction (cpuflag cx16).\n");
898 printf(" Nunc-stans must NOT be used on this system. In a future release of\n");
899 printf(" Directory Server this platform will NOT be supported.\n\n");
}}}

Correct. It can run with nunc-stans set to off.

I will update the message accordingly!

commit 5eb1977
Writing objects: 100% (5/5), 1.25 KiB | 0 bytes/s, done.
Total 5 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
590e2fb..5eb1977 master -> master

Additional one line fix:
5eb1977..975e0fa master -> master
commit 975e0fa

Metadata Update from @firstyear:
- Issue assigned to firstyear
- Issue set to the milestone: 1.3.5.11

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1984

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

a year ago

Login to comment on this ticket.

Metadata