#2 syscall bypass in io_getevents when the timeout is zero
Opened 5 years ago by avik. Modified 5 years ago

Currently io_getevents() bypasses the system call if the timeout is zero only if there are no events waiting, but it could just as well bypass the system call if events are waiting.

Hi, Avi,

I'm not sure I understand why that would make sense. A timeout of zero indicates that you want to reap available events--it's a non-blocking call, in other words. Could you provide more information on your anticipated use case?

The use case is an application that multiplexes compute with I/O on a single thread. If the application has outstanding I/O and outstanding compute tasks, then it will issue io_getevents() with a zero timeout. If no events were returned, it will go ahead and process its compute tasks. If events were received, it will process them, possibly scheduling new compute tasks to continue computation.

Another reason for this is because there is no support for IOCB_CMD_POLL, so you are obliged to call io_getevents() and epoll_wait().

See https://github.com/scylladb/seastar for an application that does this. It uses io_getevents() with a zero timeout, which works, but performs an unnecessary system call.

btw, the description was not precise: we can elide the system call if the number of events in the ring is greater than or equal to min_nr. Often, min_nr == , so we can elide the system call if either of the conditions is true:
- number of events >= min_nr
- timeout is zero

btw2, QEMU implemented this privately. /cc @bonzini

Avi Kivity pagure@pagure.io writes:

btw, the description was not precise: we can elide the system call if
the number of events in the ring is greater than or equal to

Aha! Now I understand your request. Thanks for the clarification!
I'll get this fixed up.


After thinking about this some more, I believe the library function is correct as-is. When a timeout is specified, the system call will return whatever events are available when the timeout occurs, without regard for min_nr. See the man page's "RETURN VALUE" description:

   On  success,  io_getevents() returns the number of events read: 0 if no
   events are available, or less than min_nr if the timeout  has  elapsed.
   For the failure return, see NOTES.

I worry that if I were to special case a timeout of zero to function as you described, then there would be buggy applications that would never make progress because the number of events available is less than min_nr and they never issue more I/O.

You can work around this issue on your own by looking at the ring directly (which I guess you already know). I didn't see any code in qemu to do this, however.

Please let me know if you have any questions, or if I haven't swayed your opinion. :-)

The current code is not wrong, just inefficient. It issues a system call when it could do all processing in user space. In the same way that it optimizes the no-events zero-timeout case, it could also optimize events-exist zero-timeout, as well as the sufficient-events-exist any-timeout case.

Of course an application could do this by itself, but libraries are for encapsulating this kind of code, not for forcing applications to duplicate the ring processing code.

The ring processing code in qemu is in linux-aio.c, see struct aio_ring and its users.

I don't understand why you think application behavior would change. The library function would behave the same way; it just wouldn't call the kernel any more.

OK, I think I finally understand what you were getting at. Your proposal is to only call the system call when waiting would be involved.

Yes, we could do this. I will put it back to you, though, to show me a workload where this matters. Do you have performance numbers that show that this is an issue?

Since I did not implement the optimization, I can't show you hard numbers. QEMU implemented the optimization (outside libaio) and do have numbers IIRC.

Login to comment on this ticket.