#2355 Requests queued during transition from offline to online mode
Closed: Fixed None Opened 6 years ago by endzone.

Using SSSD with "cache_credentials = true", users may experience periodic blocking for up to 6 seconds if SSSD is switching from offline to online mode and the LDAP server is unreachable.

The first request to SSSD after it has been offline for more than 60 seconds is immediately answered from the cache, but then triggers a reconnection trial to the LDAP server.

All subsequent requests reaching SSSD during the connection phase are queued and answered once the connection succeeds or fails. In case the LDAP server is unreachable, SSSD waits for 6 seconds before the connection trial is aborted. This means that the user may experience a delay of up to 6 seconds every 60 seconds (in the worst case).

See the following debug logs where the LDAP server is not responding, starting off in offline mode:

The first request to SSSD (which is triggering the reconnection trial) gets answered right away in offline mode:

(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Got request for [4097][1][idnumber=10011]
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Request processed. Returned 1,11,Fast reply - offline
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
(0x4000): beginning to connect
...
SSSD is now trying to reconnect to the LDAP server.

Only the subsequent requests that are received while SSSD is trying to (re-)connect to the LDAP server are queued until the connection times out (for at most 6 seconds). These pending requests are causing the system to block:

...

(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Got request for [4097][1][name=brauchle]
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
(0x4000): waiting for connection to complete
...
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Got request for [4097][1][idnumber=10011]
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
(0x4000): waiting for connection to complete
...

--> this is the time where the system may be unresponsive for 6 seconds <--

...

(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x0020): Failed to connect, going offline (5 [Input/output error])
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline]
(0x2000): Going offline!
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb]
(0x0080): Going offline. Running callbacks.
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x4000): notify offline to op #1
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x4000): notify offline to op #2
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
(0x0100): Request processed. Returned 1,11,Offline
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x4000): notify offline to op #3
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
(0x0100): Request processed. Returned 1,11,Offline


After the connection times out, the queued request are answered with cached entries.

So why not keep the "offline" flag set to "true" until the LDAP connection trial returns (positive or negative) and only if positive, switch to online mode?

As the first request (triggering the reconnection) is answered from the cache anyway, there is no point to keep the subsequent ones pending until the connection is established successfully.

Possibly one needs to consider that start up phase (with cold caches) as a special case and actually do queue incoming request in this case?


Repost of the log files in readable format:

The *first* request to SSSD, answered from cache. Triggers reconnect afterwards:
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][idnumber=10011]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Request processed. Returned 1,11,Fast reply - offline
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): beginning to connect
...
SSSD is now trying to reconnect to the LDAP server.

*Subsequent* requests are queued until the connection times out.
These pending requests are causing the system to block:

...
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][name=brauchle]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): waiting for connection to complete
...
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][idnumber=10011]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): waiting for connection to complete
...

--> this is the time where the system may be unresponsive for 6 seconds <--

...
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x0020): Failed to connect, going offline (5 [Input/output error])
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline]
>> (0x2000): Going offline!
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb]
>> (0x0080): Going offline. Running callbacks.
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #1
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #2
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
>> (0x0100): Request processed. Returned 1,11,Offline
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #3
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
>> (0x0100): Request processed. Returned 1,11,Offline

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.11.7

Fields changed

owner: somebody => mzidek

Fields changed

patch: 0 => 1

Pushed to sssd-1-11:
- f65efda
- e552a21
- e9ca61c

_comment0: Pushed to sssd-1-11:
f65efda
e552a21
e9ca61c
=> 1406807721234389
resolution: => fixed
status: new => closed

Metadata Update from @endzone:
- Issue assigned to mzidek
- Issue set to the milestone: SSSD 1.11.7

3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3397

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata