#955 Improve libnl integration
Closed: Fixed None Opened 9 years ago by dpal.

This has came up couple times in the past.
The use case:
1. Had a laptop connected to a network via VPN
2. Screen locked (or machine was suspended)
3. VPN connection dropped and Kerberos tickets expired (a usual weekend case)
4. SSSD does not know that the connection is lost
5. Unlock the screen, wait for the configured delay for SSSD to detect that it is in fact offline
6. Start VPN
7. Go check your mail
8. Mail client tries to acquire the service ticket but can't because TGT was not renewed yet.
9. It takes about couple minutes for SSSD to detect that connection was established and KDC might be accessible once again.
10. Currently SSSD just checks periodically and tries to re-establish connection.

Suggestion instead of periodic polling subscribe to Netlink RTMGRP_LINK to receive messages about the changes with the interfaces. This would allow SSSD to react faster both to the lost connectivity and new re-established connections.


Replying to [ticket:955 dpal]:

Suggestion instead of periodic polling subscribe to Netlink RTMGRP_LINK to receive messages about the changes with the interfaces. This would allow SSSD to react faster both to the lost connectivity and new re-established connections.

We have been doing so since SSSD 1.3 albeit only for going online, not offline. We are subscribed to RTNLGRP_LINK and try to go online if we detect IFF_LOWER_UP which is the indicator of carrier status.

The reason we have been only detecting going online is that we have no idea which interface went down. The only thing we can reasonably do is reset the offline flag if we detect link up.

This requires libnl at least 1.1, so this feature is not available on RHEL5.

I suggest we track any problems you have in individual tickets. I would still like to keep this ticket open because when I was analyzing the logs from IT I noticed that the resetOffline was delivered too often and what is more suspicios, sometimes periodically (every two minutes). I think that looks a bug in our netlink code.

I think we are talking about a little bit different logic here. In the current situation when we know that we are offline we just periodically (every 2 min AFAIR) check whether we got back online.
Instead of this I suggest we subscribe to the events that inform us about new network interfaces being added and try to connect as soon as notification was sent to us instead of polling periodically. Even if it is not the right interface the approach would reduce number of the attempts as it does not make sense to try to connect if no interface was added since the last time you tried to connect.

Replying to [comment:2 dpal]:

Instead of this I suggest we subscribe to the events that inform us about new network interfaces being added and try to connect as soon as notification was sent to us instead of polling periodically.

This is what we should be doing even now. For some reasons, we seem to be getting resetOffline notifications from netlink periodically in some cases. This is what I've noticed in the IT logs and what I need to investigate.

Replying to [comment:2 dpal]:

I think we are talking about a little bit different logic here. In the current situation when we know that we are offline we just periodically (every 2 min AFAIR) check whether we got back online.
Instead of this I suggest we subscribe to the events that inform us about new network interfaces being added and try to connect as soon as notification was sent to us instead of polling periodically. Even if it is not the right interface the approach would reduce number of the attempts as it does not make sense to try to connect if no interface was added since the last time you tried to connect.

This is exactly what we are doing (as jhrozek points out). The two minute thing is only a fallback to handle the case where the outage is on the other end of the connection (i.e. the LDAP server was being rebooted. Therefore, no change would occur on OUR end to indicate that we needed to go online).

Jakub, I have my suspicions about the extra resetOffline notifications. I think that NetworkManager is causing that when it scans the various WiFI networks in the area.

Replying to [comment:4 sgallagh]:

Replying to [comment:2 dpal]:

I think we are talking about a little bit different logic here. In the current situation when we know that we are offline we just periodically (every 2 min AFAIR) check whether we got back online.
Instead of this I suggest we subscribe to the events that inform us about new network interfaces being added and try to connect as soon as notification was sent to us instead of polling periodically. Even if it is not the right interface the approach would reduce number of the attempts as it does not make sense to try to connect if no interface was added since the last time you tried to connect.

This is exactly what we are doing (as jhrozek points out). The two minute thing is only a fallback to handle the case where the outage is on the other end of the connection (i.e. the LDAP server was being rebooted. Therefore, no change would occur on OUR end to indicate that we needed to go online).

Jakub, I have my suspicions about the extra resetOffline notifications. I think that NetworkManager is causing that when it scans the various WiFI networks in the area.

Right, that's what I suspected as well.

My plan is to catch dcbw online and ask him whether we can filter these events out. I looked at the flags libnl passes us but nothing struck me immediately. I think it would be nice to filter the extra messages because the number of online resets would decrease and so would the number of times we try to resolve hostnames etc.

When I am working from home I have a reproducible delay of at least a minute between getting on VPN and machine refreshing my Kerberos ticket. I have to wait till this time elapses before I can use my mail client. This is a consistent and easily reproducible behavior. Let me know if you need me to provide logs for this scenario. It is really annoying that SSSD can't detect that the VPN is already up for a minute and renew a ticket right away.

Replying to [comment:5 jhrozek]:

My plan is to catch dcbw online and ask him whether we can filter these events out. I looked at the flags libnl passes us but nothing struck me immediately. I think it would be nice to filter the extra messages because the number of online resets would decrease and so would the number of times we try to resolve hostnames etc.

I finally managed to discuss the issue with Dan. His response was:

LOWER_UP has no meaning on wifi interfaces at this time, because it may
only mean that there's an association with an access point, but
certainly doesn't mean that addressing or anything else has been
completed. So my recommendation is to ignore LOWER_UP/UP events for
wifi interfaces completely; the semantics of wifi are just different and
the same procedures don't apply.

So we should only use LOWER_UP for non-wireless interfaces.

In general, Dan suggested to reset our offline status when any of:
- routing table changes
- interface flags change (sans the LOWER_UP for wireless)
- IP addresses change

Translated into code, we need to add ourselves to two more libnl groups,
one for the routing table and one for the IP address, one for the routing
table and one for the IP address.

I already have a WIP patch.

milestone: SSSD Deferred => NEEDS_TRIAGE

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.7.0

Fields changed

owner: somebody => jhrozek
summary: Monitor the network changes in real time => Improve libnl integration

Fields changed

status: new => assigned

Fields changed

patch: 0 => 1

Fixed by:
- 4745ac5
- 172bf27
- 4e3495b

resolution: => fixed
status: assigned => closed

Fields changed

rhbz: => 0

Metadata Update from @dpal:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.7.0

3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/1997

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata