#1335 pungi-gather is segfaulting
Opened 4 years ago by kevin. Modified 3 years ago

This issue is sporadic and hard to track down. ;(

Since we moved bodhi-backend01 to Fedora 30 we have been hitting segfaults in updates pushes pungi runs.

[Fri Jan 24 05:26:42 2020] pungi-gather[128135]: segfault at 73252d ip 000000000073252d sp 00007ffc2e880f48 error 14 in python3.7[562d2bd75000+1000]
[Fri Jan 24 05:26:42 2020] Code: Bad RIP value.

           PID: 135273 (pungi-gather)
           UID: 48 (apache)
           GID: 48 (apache)
        Signal: 11 (SEGV)
     Timestamp: Fri 2020-01-24 08:59:48 UTC (7min ago)
  Command Line: /usr/bin/python3 -s /usr/bin/pungi-gather --config=/mnt/koji/compose/updates/Fedora-31-updates-20200124.>
    Executable: /usr/bin/python3.7
 Control Group: /system.slice/bodhi-celery.service
          Unit: bodhi-celery.service
         Slice: system.slice
       Boot ID: 505062b487c54339b0551cabf4239d41
    Machine ID: 5592ae867ee441c282a5f9f6b6fcd4da
      Hostname: bodhi-backend01.phx2.fedoraproject.org
       Storage: /var/lib/systemd/coredump/core.pungi-gather.48.505062b487c54339b0551cabf4239d41.135273.1579856388000000.>
       Message: Process 135273 (pungi-gather) of user 48 dumped core.

                Stack trace of thread 135273:
                #0  0x000000000073252d n/a (n/a)

I have been trying to get more info via strace without much luck.

Ideas for gathering more info welcome.


Here's a end of a strace of the segfaulting pungi-gather. It looks like it's reading comps?

...
105687 read(4, "groupid>\n      <groupid>kde-soft"..., 4096) = 4096
105687 read(4, "erare i servizi di infrastruttur"..., 4096) = 4096
105687 read(4, "KDE Plasma \345\267\245\344\275\234\347\251\272\351\226\223</name>\n "..., 4096) = 4096
105687 read(4, " panel, pulpit, ikony systemowe,"..., 4096) = 4096
105687 read(4, "roupid>\n    </optionlist>\n  </en"..., 4096) = 4096
105687 read(4, "\303\255ta\304\215e.</description>\n    <desc"..., 4096) = 4096
105687 read(4, "cription>\n    <description xml:l"..., 4096) = 4096
105687 read(4, "eve desenhado para computadores "..., 4096) = 4096
105687 read(4, "esktop-agents</groupid>\n      <g"..., 4096) = 4096
105687 read(4, "pl\">LXQt jest lekkim \305\233rodowiski"..., 4096) = 4096
105687 read(4, "die MATE</name>\n    <name xml:la"..., 4096) = 4096
105687 read(4, " no GNOME 2 e oferece uma podero"..., 4096) = 4096
105687 read(4, "e>\n    <name xml:lang=\"nb\">Minim"..., 4096) = 4096
105687 read(4, "N\">Fedora Server \347\211\210\346\234\254</name>\n "..., 4096) = 4096
105687 read(4, "ngan Desktop Sugar</name>\n    <n"..., 4096) = 4096
105687 read(4, "\333\214\330\261\333\214 \330\257\330\261\330\250\330\247\330\261\331\207\331\224 \333\214\330\247\330\257\332\257\333\214"..., 4096) = 4096
105687 read(4, "\260\225 \340\260\270\340\260\276\340\260\253\340\261\215\340\260\237\340\261\201\340\260\265\340\261\207\340\260\260\340\261"..., 4096) = 4096
105687 read(4, "Servidor Web</name>\n    <name xm"..., 4096) = 4096
105687 read(4, "ion>\n    <description xml:lang=\""..., 4096) = 4096
105687 read(4, "\321\200\321\201\320\276\320\275\320\260\320\273\321\214\320\275\320\270\321\205 \320\272\320\276\320\274\320\277\342\200\231"..., 4096) = 4096
105687 read(4, "cription xml:lang=\"ca\">Un entorn"..., 4096) = 4096
105687 read(4, "\340\245\211\340\244\252 \340\244\265\340\244\276\340\244\244\340\244\276\340\244\265\340\244\260\340\244\243 \340\244\234"..., 4096) = 4096
105687 read(4, "<description xml:lang=\"zh_CN\">\344\270"..., 4096) = 2605
105687 read(4, "", 4096)                = 0
105687 close(4)                         = 0
105687 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x5593deb16050} ---
105687 +++ killed by SIGSEGV (core dumped) +++

What was it on before? F29? Trying to figure out the delta. If it's crashing reading comps, issue may be in libcomps. That maxed out at 0.1.11 in F29. In F30 it started at 0.1.12 and is now 0.1.14.

It was f29, now f30.

So, I see this:

https://github.com/rpm-software-management/libcomps/pull/56

that change went into 0.1.12. It's supposed to fix a segfault, but I guess for us it could possibly cause one?

ok, downgraded to libcomps-0.1.11-1.fc30. We should see what happens with the next push at 00:00UTC

OK. However, sad news - poked about a bit more and found/remembered that only python3-dnf uses libcomps (via python3-libcomps). libdnf has a bunch of its own code for handling comps. So we might also be in that, in which case it's time for some hard drinkin'.

At a quick glance, the only libdnf change which seems to touch comps code is this one. But, might be missing something.

I managed to get a python stack trace with GDB:

(gdb) py-bt
Traceback (most recent call first):
  Garbage-collecting
  File "/usr/lib/python3.7/site-packages/pungi/dnf_wrapper.py", line 130, in get_langpacks
    result.append({"name": name, "install": install})
  File "/usr/bin/pungi-gather", line 127, in main
    gather_opts.langpacks = dnf_obj.comps_wrapper.get_langpacks()
  File "/usr/bin/pungi-gather", line 179, in <module>
    main(persistdir, cachedir)

The problem seems to be that swig is generating invalid code for libdnf Python bindings.
There is an update with fixed swig.
https://bodhi.fedoraproject.org/updates/FEDORA-2020-ba4b52e9ff
And I asked for libdnf to be rebuilt with it:
https://bugzilla.redhat.com/show_bug.cgi?id=1798389

Since no-one else seemed to be doing it, I did the rebuild and sent an update.

This looks to have been resolved by the swig/dnf update. I'll close this ticket.

Metadata Update from @lsedlar:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Metadata Update from @mohanboddu:
- Issue status updated to: Open (was: Closed)

3 years ago

This started happening again from rawhide's 28.n.0 compose.

Things that got changed on the box before the compose is

Packages Altered:
    Upgrade  libarchive-3.4.3-1.fc32.x86_64                 @updates
    Upgraded libarchive-3.4.2-1.fc32.x86_64                 @@System
    Upgrade  python-unversioned-command-3.8.3-1.fc32.noarch @updates
    Upgraded python-unversioned-command-3.8.2-2.fc32.noarch @@System
    Upgrade  python3-3.8.3-1.fc32.x86_64                    @updates
    Upgraded python3-3.8.2-2.fc32.x86_64                    @@System
    Upgrade  python3-libs-3.8.3-1.fc32.x86_64               @updates
    Upgraded python3-libs-3.8.2-2.fc32.x86_64               @@System
    Upgrade  python3-tkinter-3.8.3-1.fc32.x86_64            @updates
    Upgraded python3-tkinter-3.8.2-2.fc32.x86_64            @@System

I am also attaching the coredumps from the first one to the recent one.

I get a 403 trying to download those coredumps.

There are a whole lot of of potential causes to this of course. Ultimately I think it's just going to be more robust for pungi to use dnf out of process for one thing.

I was tagged into this issue relating to the ostree embed templates; do we think this is somehow related to that? I am doubtful (note it's running ostree out of process) but it could somehow relate to running through new codepaths in lorax.

I get a 403 trying to download those coredumps.

This should be fixed now.

I was tagged into this issue relating to the ostree embed templates; do we think this is somehow related to that? I am doubtful (note it's running ostree out of process) but it could somehow relate to running through new codepaths in lorax.

I am not sure.

@walters if you're referring to my IRC message that was me rolling straight out of bed and into this issue without coffee, so it's possible I may have alighted on the wrong thing :) I just saw it was failing on those Lorax templates...

OK so installing debuginfo,

(gdb) t a a bt

Thread 1 (Thread 0x7fbfcbcae740 (LWP 1751577)):
#0  0x0000000000000031 in  ()
#1  0x00007fbfcbf26089 in visit_decref (op=<unknown at remote 0x562eca28ad60>, parent=<optimized out>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Objects/typeobject.c:3610
#2  0x00007fbfcbf25d74 in subtract_refs (containers=containers@entry=0x7fbfcc13a610 <_PyRuntime+368>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:406
#3  0x00007fbfcbf25487 in collect (generation=0, n_collected=0x7fff9bcb9d38, n_uncollectable=0x7fff9bcb9d40, nofail=0, state=0x7fbfcc13a5f8 <_PyRuntime+344>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:1054
#4  0x00007fbfcbfa425e in collect_with_callback (generation=generation@entry=0, state=0x7fbfcc13a5f8 <_PyRuntime+344>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:1240
#5  0x00007fbfcbf1ee65 in collect_generations (state=<optimized out>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:1978
#6  _PyObject_GC_Alloc (use_calloc=<optimized out>, basicsize=<optimized out>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:1977
#7  0x00007fbfcbf1f535 in _PyObject_GC_Malloc (basicsize=<optimized out>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:1999
#8  _PyObject_GC_New (tp=0x7fbfcc134c60 <PyDict_Type>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Modules/gcmodule.c:1999
#9  0x00007fbfcbf1f4cc in new_dict (values=0x7fbfcc156dd0 <empty_values>, keys=0x7fbfcc11e9a0 <empty_keys_struct>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Objects/dictobject.c:609
#10 PyDict_New () at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Objects/dictobject.c:702
#11 0x00007fbfcbf334cb in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Python/ceval.c:2877
#12 0x00007fbfcbf3d0e7 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7fbfc7b05240, for file /usr/lib/python3.8/site-packages/pungi/dnf_wrapper.py, line 141, in get_langpacks (self=<CompsWrapper(dnf=<DnfWrapper(_closed=False, _conf=<Conf(_config=<ConfigMain(this=<SwigPyObject at remote 0x7fbfc7c0c040>) at remote 0x7fbfc7c08f50>, _section='main', substitutions=<Substitutions at remote 0x7fbfcb2ade50>, tempfiles=[]) at remote 0x7fbfc7c08c80>, _goal=<Goal(group_members=set()) at remote 0x7fbfc7b596b0>, _repo_persistor=None, _sack=<Sack at remote 0x7fbfc7bf9dd0>, _transaction=None, _priv_ts=None, _comps=<Comps(_i=<_libpycomps.Comps at remote 0x7fbfc7b59a60>, _langs=<_Langs(last_locale=None, cache=None) at remote 0x7fbfc7c0c0a0>) at remote 0x7fbfc7b5e6e0>, _comps_trans=<TransactionBunch(_install=set(), _install_opt=set(), _remove=set(), _upgrade=set()) at remote 0x7fbfc7b5e690>, _history=None, _tempfiles=set(), _trans_tempfiles=set(), _ds_callback=<Depsolve at remote 0x7fbfc7c0c190>, _logging=<Logging(stdout_handler=None, stderr_handler=None) at remote 0x7fbfc7c0c460>, _repo...(truncated)) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Python/ceval.c:738
#13 function_code_fastcall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at /usr/src/debug/python3-3.8.3-1.fc32.x86_64/Objects/call.c:283

Definitely some libcomps objects involved there. I only briefly glanced at that libcomps commit but does look like a potential issue.

The problem is in python3-3.8.3-1.fc32 since the downgrade to python3-3.8.2-2.fc32 has worked.

https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200529.n.2/STATUS

Keeping the ticket open to check if something on pungi side needs to be changed or if its a python bug, please close this ticket with a reference to python bug, so that we can follow it.

Login to comment on this ticket.

Metadata