#3679 vm: Retry libvirt connection
Merged a year ago by tkopecek. Opened a year ago by tkopecek.
tkopecek/koji issue985  into  master

file modified
+39 -1
@@ -59,6 +59,44 @@ 

  )

  

  

+ class LibvirtCallable(object):

+     # Wrapper class for libvirt call, used in LibvirtConnection

+     def __init__(self, libvirt_conn, method):

+         self.conn = libvirt_conn

+         self.method = method

+ 

+     def __call__(self, *args, **kwargs):

+         max_retries = 3

+         retry = 0

+         for retry in range(max_retries):

+             try:

+                 return getattr(self.conn._conn, self.method)(*args, **kwargs)

+             except libvirt.libvirtError as ex:

+                 if ex.get_error_code() == libvirt.VIR_ERR_INTERNAL_ERROR and retry < max_retries:

+                     logging.getLogger("koji.vm").warning(

+                         'Internal libvirt error: %s, retry #%d/%d' % (ex, retry, max_retries))

+                     time.sleep(5)

+                     try:

+                         self.conn._conn.close()

+                     except Exception:

+                         pass

+                     self.conn._conn = libvirt.open(None)

+                 else:

+                     raise

+ 

+ 

+ class LibvirtConnection(object):

+     # Wrapper class for libvirt connection retrying calls failed on libvirt connection

+     # e.g. libvirtd restart. Without it kojivmd stucks in main loop

+     def __init__(self) -> None:

+         self._conn = libvirt.open(None)

+ 

+     def __getattr__(self, name):

+         if name == '_conn':

+             return self._conn

+         return LibvirtCallable(self, name)

+ 

+ 

  # Register libvirt handler

  def libvirt_callback(ignore, err):

      if err[3] != libvirt.VIR_ERR_ERROR:
@@ -916,7 +954,7 @@ 

  class VMTaskManager(TaskManager):

      def __init__(self, options, session):

          super(VMTaskManager, self).__init__(options, session)

-         self.libvirt_conn = libvirt.open(None)

+         self.libvirt_conn = LibvirtConnection()

          self.macaddrs = {}

          self.macaddr_lock = threading.Lock()

          self.expired_vms = {}

Simple retries, it will allow kojivmd to continue to work. Nevertheless, it can potentially leave running VMs.

Overall approach seems fine.

I had a slight concern about whether the underlying connection object could have any non-callable attributes we might care about, but it appears not.

def __getattr__(self, __name: str):

I would avoid using type hints in Koji code for now.

Also, why not just name. I believe this is the expected signature.

while True:
    try:
        return getattr(self.conn._conn, self.method)(*args, **kwargs)
    except libvirt.libvirtError as ex:
        if ex.get_error_code() == libvirt.VIR_ERR_INTERNAL_ERROR:
            if retry > max_retries:
                raise
            retry += 1
            logging.getLogger("koji.vm").warning(
                'Internal libvirt error: %s, retry #%d/%d' % (ex, retry, max_retries))
            time.sleep(5)
            self.conn._conn = libvirt.open(None)

This is ignoring errors that don't match the code. Perhaps you wanted an unconditional raise at the end?

The original issue was concerned with stale connections after libvirt restarts. It looks like this could catch a somewhat broader class. Is there any reasonable way to pare it down?

If we're just going to have a fixed number of retries, it might be better to just have for retry in range(N)

We can catch a string in the exception. "Internal libvirt error: internal error: client socket is closed" on the other hand, maybe it shouldn't hurt to retry on all libvir internal errors?

rebased onto b8ea345

a year ago

I'm cautious about broad retries unless I'm sure that the calls involved are idempotent.

Also, I note that that a lot of what we do with libvirt is via the vm objects rather than the connection objects, which are not wrapped here.

It's been a several years since I filed this, and unfortunately the issue is light on detail, but I believe I was mainly concerned about a pattern of continual failure (or task skipping) in kojivmd due to the fact that VMTaskManager uses a single connection (defined in its init) that persists for the life of the daemon. So, main thing is to address reconnection there. Might be safer was well.

If we continue with the wrapper, we probably need to do something special with close.

OTOH, we could just force a reconnect when we catch an error in main

Whichever way we go, we should probably attempt to call close() on the failing connection before we throw it away.

1 new commit added

  • try to close failing connection
a year ago

It is intentional, that vm is not wrapped. This wrapper catches only main loop problems. Underlying VM will still run which I've mentioned in the first comment. So, main loop will continue, current task will fail, VM still runs (if it exists).

It can be easily tested - just restart libvirtd while task is running. Place of failure is in checkMem which uses stale connection in that point.

Next option is to catch the problem just there - it is like a watchog, as it is called every while. Maybe better place?

This wrapper catches only main loop problems

If we only want to catch main loop problems, we shouldn't use LibvirtConnection in the VMExecTask handler. Granted the calls there are mostly informational, but it's a little unclear to me how safe it is retry conn.defineXML(clone_xml).

The main loop calls appear to be all informational, at the libvirt_conn level. listDomainsID, lookupByID, lookupByName, listDefinedDomains.

1 new commit added

  • use wrapper only in main thread
a year ago

Metadata Update from @tkopecek:
- Pull-request tagged with: testing-ready

a year ago

Metadata Update from @tkopecek:
- Pull-request untagged with: testing-ready
- Pull-request tagged with: no_qe

a year ago

Commit 6d80b3b fixes this pull-request

Pull-Request has been merged by tkopecek

a year ago
Metadata