PR#51096: Issue 51095 - Check the return code from the time functions - 389-ds-base

389-ds-base

#51096 Issue 51095 - Check the return code from the time functions

Closed 3 years ago by spichugi. Opened 3 years ago by mreynolds.

mreynolds/389-ds-base issu51095 into master

Issue 51095 - Check the return code from the time functions

Mark Reynolds • 3 years ago

85d218f

.gitignore

file modified

		`@@ -234,3 +234,4 @@`
		`ldap/servers/slapd/rust-slapi-private.h`
		`vendor`
		`vendor.tar.gz`
		`+ /build/`

ldap/servers/slapd/csngen.c

file modified

		`@@ -185,6 +185,12 @@`

		`/* check if the time should be adjusted */`
		`delta = cur_time - gen->state.sampled_time;`
		`+ if (delta > _SEC_PER_DAY \|\| delta < (-1 * _SEC_PER_DAY)) {`
		`+ /* We had a jump larger than a day */`
		`+ slapi_log_err(SLAPI_LOG_REPL, "csngen_new_csn",`
		`+ "Detected large jump in CSN time. Delta: %d (current time: %ld vs previous time: %ld)\n",`
		`+ delta, cur_time, gen->state.sampled_time);`
		`+ }`
		`if (delta > 0) {`
		`rc = _csngen_adjust_local_time(gen, cur_time);`
		`if (rc != CSN_SUCCESS) {`

ldap/servers/slapd/time.c

file modified

+49 -13

		`@@ -61,6 +61,42 @@`
		`return 0;`
		`}`

		`+ /*`
		`+ * On some cloud platforms the time functions don't always work as expected.`
		`+ * We should be checking for errors. Since getting the time is so important`
		`+ * to the server, if we do get an error we should retry a few times before`
		`+ * giving up.`
		`+ */`
		`+ static int32_t`
		`+ slapi_clock_gettime(int64_t clock_id, struct timespec *tp)`
		`+ {`
		`+ int32_t count = 0;`
		`+`
		`+ while (count < 5) {`
		`+ if (clock_gettime(clock_id, tp) == 0) {`
		`+ /* Success, we got the time */`
		`+ if (count > 0) {`
		`+ /* This succeeded on a retry, log a follow up message */`
		`+ slapi_log_err(SLAPI_LOG_NOTICE, "slapi_clock_gettime",`
		`+ "Successfully retrieved system time\n");`
		`+ }`
		`+ return 0;`
		`+ } else {`
		`+ /* Failed to get the system time */`
		`+ slapi_log_err(SLAPI_LOG_ERR, "slapi_clock_gettime",`
		`+ "Getting system time failed! Errno (%d) (%s) Trying again ...\n",`
		`+ errno, slapd_system_strerror(errno));`
		`+ DS_Sleep(PR_MillisecondsToInterval(100));`
		`+ count++;`
		`+ }`
		`+ }`
		`+ slapi_log_err(SLAPI_LOG_CRIT, "slapi_clock_gettime",`
		`+ "Getting system time failed! Errno (%d) (%s) Giving up.\n",`
		`+ errno, slapd_system_strerror(errno));`
		`+ return -1;`
		`+`
		`+ }`
		`+`
		`time_t`
		`current_time(void)`
		`{`
		`@@ -69,8 +105,8 @@`
		`* but this should be removed in favour of the`
		`* more accurately named slapi_current_utc_time`
		`*/`
		`- struct timespec now;`
		`- clock_gettime(CLOCK_REALTIME, &now);`
		`+ struct timespec now = {0};`
		`+ slapi_clock_gettime(CLOCK_REALTIME, &now);`
		`return now.tv_sec;`
		`}`

		`@@ -83,24 +119,24 @@`
		`struct timespec`
		`slapi_current_rel_time_hr(void)`
		`{`
		`- struct timespec now;`
		`- clock_gettime(CLOCK_MONOTONIC, &now);`
		`+ struct timespec now = {0};`
		`+ slapi_clock_gettime(CLOCK_MONOTONIC, &now);`
		`return now;`
		`}`

		`struct timespec`
		`slapi_current_utc_time_hr(void)`
		`{`
		`- struct timespec ltnow;`
		`- clock_gettime(CLOCK_REALTIME, &ltnow);`
		`+ struct timespec ltnow = {0};`
		`+ slapi_clock_gettime(CLOCK_REALTIME, &ltnow);`
		`return ltnow;`
		`}`

		`time_t`
		`slapi_current_utc_time(void)`
		`{`
		`- struct timespec ltnow;`
		`- clock_gettime(CLOCK_REALTIME, &ltnow);`
		`+ struct timespec ltnow = {0};`
		`+ slapi_clock_gettime(CLOCK_REALTIME, &ltnow);`
		`return ltnow.tv_sec;`
		`}`

		`@@ -108,9 +144,9 @@`
		`slapi_timestamp_utc_hr(char *buf, size_t bufsize)`
		`{`
		`PR_ASSERT(bufsize >= SLAPI_TIMESTAMP_BUFSIZE);`
		`- struct timespec ltnow;`
		`+ struct timespec ltnow = {0};`
		`struct tm utctm;`
		`- clock_gettime(CLOCK_REALTIME, &ltnow);`
		`+ slapi_clock_gettime(CLOCK_REALTIME, &ltnow);`
		`gmtime_r(&(ltnow.tv_sec), &utctm);`
		`strftime(buf, bufsize, "%Y%m%d%H%M%SZ", &utctm);`
		`}`
		`@@ -252,7 +288,7 @@`
		`expire->tv_sec = 0;`
		`expire->tv_nsec = 0;`
		`} else {`
		`- clock_gettime(CLOCK_MONOTONIC, expire);`
		`+ slapi_clock_gettime(CLOCK_MONOTONIC, expire);`
		`expire->tv_sec += timeout;`
		`}`
		`}`
		`@@ -278,8 +314,8 @@`
		`if (expire->tv_sec == 0 && expire->tv_nsec == 0) {`
		`return TIMER_CONTINUE;`
		`}`
		`- struct timespec now;`
		`- clock_gettime(CLOCK_MONOTONIC, &now);`
		`+ struct timespec now = {0};`
		`+ slapi_clock_gettime(CLOCK_MONOTONIC, &now);`
		`if (now.tv_sec > expire->tv_sec \|\|`
		`(expire->tv_sec == now.tv_sec && now.tv_sec > expire->tv_nsec)) {`
		`return TIMER_EXPIRED;`

mreynolds commented 3 years ago

Bug Description:

On some cloud platforms we see issues with replication time skew. It's not clear if the time functions are failing or just returning the wrong value. We do not check the result code from any of the time functions.

Fix Description:

Add a wrapper function for clock_gettime() that will check for errors, log a message, and retry a few times. We retry because getting the time is so critical to things like replication and logging.

Also added a check for large jump in CSN generation times, and we log a message if detected.

relates: https://pagure.io/389-ds-base/issue/51095

rebased onto 6002936e668e606542041cbb4e0cf609c7a3faf7

3 years ago

rebased onto 2fe3660d7ee296678c3cd1bb968814c4c31bce84

3 years ago

rebased onto b8fc4bcfa40a1ab8d7ae2e727dbed0fc2704873a

3 years ago

firstyear commented 3 years ago

So there are some problems with this I think. Most of the get time functions are vdso's so they are able to cache and provide the time multiple times quickly in the caller, which is how they are "fast".

So I think that this isn't really checking that the time hasn't skewed.

As well, monotonic is guaranteed to also only advance, it's only realtime that may move forward and backward.

Really, the issue is in CSN that if the time does go backwards, it should be ignoring that and using the last time stamp, which i believe it already has support for as it's a lamport clock.

I think that this isn't really the solution to the problem, and we need to check that the csn is correctly functioning as a lamport clock.

mreynolds commented 3 years ago

This isn't exactly fixing anything, its providing more information. Really it's for debug purposes. We know that somehow the time is coming back corrupted on AWS. We need to log something when that happens. Second, I don't know if calling clock_getime() multiple times will fix it, but we have to try something because CSN's need the correct time or we end up in a f**ked situation and have to take drastic steps to fix it (see the how fix time skew wiki doc). We just went through two major escalations a few weeks ago on this (which you were not aware of). This patch is to help narrow down the root cause. It will tell us if the bad "time" simply comes from a failed clock call, or a "valid" returned value. Please don't dismiss this patch so quickly.

firstyear commented 3 years ago

But it doesn't get us more information, because that's not how clocks work. If the problem is the clock is jumping about, it's a problem in how the lamport clock in CSN is working, and we should be looking there.

This patch won't have any effect on monotonic, because monotonic can not go backwards, and if it does, that's a glibc/kernel bug.

And because the clock is a vdso, then each thread can have a different time and a per thread cache.

http://man7.org/linux/man-pages/man7/vdso.7.html

I'm saying that the patch provided will not provide you the information you need to solve this, and if there is a problem in CSN, then we should harden that part of the code, and inspect it to ensure that it's lamport clock is correct instead of this.

firstyear commented 3 years ago

Also the clock realtime moves about by design. So this is not a bug in get time or anything else, it's pointing to an issue in CSN.

tbordaz commented 3 years ago

I think there are two aspects in that patch.
clock_gettime is a libc system calls (even if it is not a real system call thanks to vdso). If it returns a failure it is a good idea to log an error and to retry. Possibly the retry will fail as well but at least we will have some logs that a failure happens. IMHO it looks important to diagnose CSN time skew.

Then the detection of the time jump. I think we have an issue to base it on sampled_time. It is refresh on demand. So if no CSN is generated locally for more than a day we will have a false positive. We could implement a dedicated thread that periodically (e.g. 10s) update it, for example via a csn generation.

mreynolds commented 3 years ago

We were seeing swings in years, not seconds. We do not know what is happening on these AWS systems, so we need more information. Is clock_gettime() returning a bad value, or is it simply failing and returning an error that we are not checking for? We don't know. So I thought maybe the time function was failing and the timespec struct, which was previously uninitialized, was leading to these huge swings in time (as garbage values were being read). This patch will at least tell us if the time function returns an error, previously we didn't check anything and were potentially returned garbage times to the caller.

As for the "time jumps", that's in the replication logging, and will not be displayed in the default logs. it is just an informative message in the replication debug logs. It's not alarming. I think it's fine how it is, and it's a way to track the time when things go wrong so further investigation can be done on the system (check ntp logs, syslogs, etc). For example, lets say clock_gettime() does not return an error, but the value is "bad". Well we won't log anything because clock_gettime() did not return an error, and the only symptom is a massive jump in time. So we need to catch this scenario as well. Like I keep saying this is a debug patch (maybe the first of many), and it is certainly better than doing nothing!

We have customers running into this "right now", and I want to get this into RHEL 7.9. If we debate this much longer it will miss RHEL 7.9, and it's the customer who will suffer due to our inaction.

lkrispen commented 3 years ago

Even if there are concerns that this does not fix the issue, I think it does no harm (small overhead by having one more function call, could also be directlky in current_time).
And we need to add some possobiloity to diagnose what goes on in AWS.

So I vote to include this patch

tbordaz commented on line 6 of ldap/servers/slapd/csngen.c 3 years ago

Shouldn't it be (delta > _SEC_PER_DAY) || (delta < (-1 * _SEC_PER_DAY))

rebased onto 0cb8eabf620a9657b1e7501807ace360612101f5

3 years ago

tbordaz commented on line 7 of ldap/servers/slapd/csngen.c 3 years ago

SLAPI_LOG_INFO ? so that if a time skew is discovered later we will know for sure that this host was responsible of a huge jump .

mreynolds commented 3 years ago

Shouldn't it be (delta > _SEC_PER_DAY) || (delta < (-1 * _SEC_PER_DAY))

Fixed

mreynolds commented 3 years ago

SLAPI_LOG_INFO ? so that if a time skew is discovered later we will know for sure that this host was responsible of a huge jump .

Well, if a replica is left idle for a day, which happens, then it will log this by default. I was hoping to reduce false alarms by keeping it under the replication logging. I hear what you are saying, if repl logging is not enabled then we will miss this event.

firstyear commented 3 years ago

Even if there are concerns that this does not fix the issue, I think it does no harm (small overhead by having one more function call, could also be directlky in current_time).
And we need to add some possobiloity to diagnose what goes on in AWS.
So I vote to include this patch

It's not that it does harm, it doesn't do what you think. The clock result will be cached, so polling it rapidly only proves the cache works, not that the clock is moving backwards.

There is also no reason to check this on monotonic.

So no, I don't think we should include it, because this is not going to actually help highlight what you think it will highlight.

If anything, if you want this to be done correctly, you need to harden the checks on CSN about what happens if it gets a time that's lower than the previous time.

I'm really not comfortable with this patch being included.

tbordaz commented 3 years ago

@firstyear, thanks for this interesting discussion.

Googling I have not found any cache around gettimeofday. Also a loop calling gettimeofday or clock_gettime(REALTIME) shows always different usec/nsec. On my test the caching seems to have no impact.

If the cache returns the same value, I agree the loop is useless but it is safe and it will log the failure of the syscall as a head up.

I agree we need to improve the robustness/check on CSN but this ticket is just to raise warning when those syscall fails.

mreynolds commented 3 years ago

Even if there are concerns that this does not fix the issue, I think it does no harm (small overhead by having one more function call, could also be directlky in current_time).
And we need to add some possobiloity to diagnose what goes on in AWS.
So I vote to include this patch

It's not that it does harm, it doesn't do what you think. The clock result will be cached, so polling it rapidly only proves the cache works, not that the clock is moving backwards.
There is also no reason to check this on monotonic.
So no, I don't think we should include it, because this is not going to actually help highlight what you think it will highlight.
If anything, if you want this to be done correctly, you need to harden the checks on CSN about what happens if it gets a time that's lower than the previous time.
I'm really not comfortable with this patch being included.

I'm still dumbfounded by your response to this patch. You really don't want us to log a message if clock_gettime() fails?!?! Seriously, why not? The function can fail, so we should check it. Sure the retry loop might be useless (maybe?), but that is not the point of this patch. This patch is to check if that function is part of the problem or not. Like I've said from the beginning this is a debug logging patch, not a fix (sure it takes a stab at maybe fixing it, but it's not the main purpose).

But are you saying if clock_gettime() returns an error, that retrying will also "always" result in an error? Are you so sure of that? If the function returns an error it will leave the timespec in an uninitialized state, but you are saying that it is better to not try clock_gettime() again to see if succeeds? We do not retry clock_gettime() based on the time value it returns, but if the function call itself returns an error. I am really having a hard time trying to understand why you have a problem this patch :-( The points you keep making have nothing to do with what this patch is trying to do.

firstyear commented 3 years ago

If it errors, then just bail, check that result fine, but don't loop on it. That's my point.

firstyear commented 3 years ago

But also, if it's failing and the timespec is 0, or the clock is moving around, and that triggers an issue in CSN, then we should be putting the logging/checks there to harden csn.

firstyear commented on line 16 of ldap/servers/slapd/time.c 3 years ago

If you remove this line and the related count, then it's ack from me.

mreynolds commented 3 years ago

If it errors, then just bail, check that result fine, but don't loop on it. That's my point.

Why not try it again if the function fails? Here's the thing, if we don't get the time when generating a CSN then that CSN is corrupted. We can't just ignore it as it potentially breaks replication which requires a very difficult recovery process (see the wiki doc on fixing time skew). We should do all we can to prevent an invalid CSN from being created. So if the time function fails then the only option is to try it again in hopes that the function succeeds. Otherwise we basically need to abruptly stop the server, or halt replication, to prevent the replication state data from becoming corrupted. I don't see how that is a good option.

IHMO we should actually keep retrying it forever until the time function succeeds - this is course is assuming that the time function failing is the problem. We don't know if it is, but before this patch if it did fail then the uninitialized timespec struct was set to garage values which explains jumps that went forward and backward in time by hundreds of years. So the one thing this patch does is at least initialize the timespec struct, and If it fails the CSN times are set to zero, but that will still break replication.

I still don't understand why you think if the function call fails it will never recover. I'm not backing down off the retry part of this patch. Now if we apply this patch and customer shows that indeed the function was failing, and indeed retrying doesn't help then I'll remove it, but I see absolutely no reason to not retry it because the potential benefits (not breaking replication!!!) far out weight the cost of calling a system function a few extra times. I still can't believe we are arguing over this :-(

Maybe we need to take a vote here among the developers about this patch? @lkrispen, @tbordaz, @mhonek , @spichugi thoughts?

tbordaz commented 3 years ago

Guess what, I agree with @mreynolds and @firstyear :)

@mreynolds, I think your fix is helpful for diagnose and is safe. Even if the loop is useless, it has minimal and rare performance impact so it is acceptable. If caching makes the loop useless, I would even suggest to add a short sleep between the retries just to make it more robust to transient failure.

@firstyear, I agree we have something to fix in csn. At the moment I have not a clear understanding how things get broken to identify what fix we need. However, I think we can improve robustness of csngen_new_csn. If slapi_current_utc_time fails while generating a new CSN we could make csngen_new_csn fail that will abort the operation. Likely that mean we need csngen_new_csn to call an other function than slapi_current_utc_time to catch the error.

mreynolds commented 3 years ago

Guess what, I agree with @mreynolds and @firstyear :)
@mreynolds, I think your fix is helpful for diagnose and is safe. Even if the loop is useless, it has minimal and rare performance impact so it is acceptable. If caching makes the loop useless,

But if the function fails, would that "failure" be cached? This isn't about the value returned by the time function, but that the time function itself fails and sets no value. I do not see how caching would impact a failure result since we don't know why it's failing (or if it's even failing).

When we saw this with customers the "jump" seemed to be a one time incident, not a constantly repeating one.

I would even suggest to add a short sleep between the retries just to make it more robust to transient failure.

We could do that...

@firstyear, I agree we have something to fix in csn. At the moment I have not a clear understanding how things get broken to identify what fix we need. However, I think we can improve robustness of csngen_new_csn. If slapi_current_utc_time fails while generating a new CSN we could make csngen_new_csn fail that will abort the operation. Likely that mean we need csngen_new_csn to call an other function than slapi_current_utc_time to catch the error.

This seems like a big code change, and I'm not sure the code path exists to return this error all the way back up the "modify" function where it can be aborted. Needs more investigation, and a separate ticket.

mhonek commented 3 years ago

My 2c... for what this patch is supposed to do it LGTM. From my point of view, if this may help to pinpoint the issue or render the issue to be unrelated, it is still worth including. This patch is not harmful but potentially helpful, so +1 from me. And the 5-count should stay, it might show useful data if anything.

Just curious, shouldn't be timespec initialized with {0,0}?

tbordaz commented 3 years ago

I agree with @mhonek, keeping this ticket related to help diagnose it looks good even if the loop is useless. (I would just add a small delay on retry). I vote for pushing upstream this patch.

rebased onto c3fa64c0ff887a5239731771c89f0d3bf125ac65

3 years ago

mreynolds commented 3 years ago

I agree with @mhonek, keeping this ticket related to help diagnose it looks good even if the loop is useless. (I would just add a small delay on retry). I vote for pushing upstream this patch.

I added 100ms sleep between retries...

firstyear commented 3 years ago

Just curious, shouldn't be timespec initialized with {0,0}?

{0} is C 99 syntax to 0 the whole struct.

I added 100ms sleep between retries...

If the clock is failing, it's going to keep failing, sleeps or not. I don't like the while loop - if we get a failure, we fail fast and return. Polling proves only that the vdso cache works, and that if we get errors we'll keep getting errors.

I really don't want a while loop in this section of the code, especially when you consider that it's in the logging path, which is under a mutex, that every thread contends on, this could add extra performance delays.

Even more so on cloud infra, the problem is really likely to not be that the clock is failing, but the VM was teleported to a new host, which has a subteley different UTC time for example, and that could cause clock realtime to report forward/backward steps. Then you get google that time-smears which also has similar effects and more.

Thinking even more about it, your patch should not limit to to showing backward jumps of a day, but all backward jumps, and if what you are saying is real, it makes me anxious that there is an issue in the csngen that means it's not functional as a true lamport clock.

Also I think the logic in csngen.c could have a 2038 bug, because if the clock ever goes to 0, the diff to cur_time will exceed a 32bit int. So it should be:

if (cur_time <= gen->state.sampled_time) {
     // we've stepped backwards, log as such.
} else {
    // update time
}

Anyway, I will only accept this if there is no loop to poll the clock.

firstyear commented 3 years ago

Actually, you can't log from within the clock get time anyway, that's going to deadlock.

EDIT: because we call get time from slapi_log_err, which holds a lock that's not recursive. So you can't log from in this, you have to not loop, return the error, and the caller needs to log that something went bang.

Edited 3 years ago by firstyear

mreynolds commented 3 years ago

Actually, you can't log from within the clock get time anyway, that's going to deadlock.
EDIT: because we call get time from slapi_log_err, which holds a lock that's not recursive. So you can't log from in this, you have to not loop, return the error, and the caller needs to log that something went bang.

Damn it, you're right. I saw this before. That is a problem from the reporting side of things.

Well I started looking into what it will take to catch a clock error in the replication code and reject the operation (not ideal but really the only option to ensure replication does not get hosed). We could log the error there, but I need to see if I can get that working correctly and not break replication in the process.

mreynolds commented 3 years ago

I'm closing this PR as I'm working on a actual fix instead of a debug patch...

Pull-Request has been closed by mreynolds

3 years ago