It is critical for Nunc-Stans to maintain consistent, and correct state of jobs. Without this, many unknown and complex side effects can occur in an async event library such as this which would be nearly impossible to debug.
To assert the correctness of this we should implement nunc-stans jobs to follow a state machine. When a job is created it is "WAITING" further instruction and setup. When the job is set to rearm, or it is created as an ns_add_*job, it is moved to ARMED. While ARMED tasks such as setting data, or modifying the job should not be possible. When the job is triggered, it is moved to the WAITING state for re-arming, or in the case of persistent jobs, it remains ARMED.
When a job is deleted it moves from WAITING to NEEDS_DELETE. The event api then collects this and removes the job, moving it in the interim to DELETED, then freeing it.
A job that is ARMED cannot be set to NEEDS_DELETE, because if the event api frees the job while a call back in processing, horrible race conditions can occur. Additionally, it may cause the valid states of the callback to be violated. A job must be WAITING to be set to NEEDS_DELETE.
This however does not hold during server shutdown. During shutdown all jobs regardless of state are moved to NEEDS_DELETE, then removed. In this case, the event_queue is not processing which means the race conditions cannot occur.
The enforcement of this state machine is enabled with -DDEBUG due to the use of PR_ASSERT. It's extraordinarily valuable to run development applications with this to find faults in their usage of nunc-stans.
Additionally, due to this, this discovered a number of logic, heap use after free, and double free issues. These are fixed with this ticket.
attachment 0002-Ticket-57-Update-the-configure-and-autotools-files.patch
There is another param -DDEBUG_FSM that gives extra logging of the state transitions and decisions.
May not apply without #51, #54, #56
attachment 0001-Ticket-57-Implement-a-strict-state-machine-for-nunc-.patch
This builds and runs correctly with Directory Server master, under ldclt also.
Fix for server shutdown job done 0001-Ticket-57-Implement-a-strict-state-machine-for-nunc-.2.patch
Don't need to re-run the ds reliab test, they don't use this code. Passes my echo server test.
Looks good to me.
How do you disarm a persistent job?
At this point, the job is not threaded? {{{
466 } else if (NS_JOB_IS_PERSIST(job->job_type)) { 467 #ifdef DEBUG_FSM 468 ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n"); 469 #endif 470 job->func(job);
}}}
What if the job is ARMED here? {{{
715 862 ns_job_type_t 716 863 ns_job_get_output_type(ns_job_t *job) 717 864 { 865 PR_ASSERT(job); 866 PR_ASSERT(job->state == NS_JOB_WAITING); 718 867 return job->output_job_type; }}}
Replying to [comment:7 rmeggins]:
How do you disarm a persistent job? At this point, the job is not threaded? {{{ 466 } else if (NS_JOB_IS_PERSIST(job->job_type)) { 467 #ifdef DEBUG_FSM 468 ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n"); 469 #endif 470 job->func(job); }}}
466 } else if (NS_JOB_IS_PERSIST(job->job_type)) { 467 #ifdef DEBUG_FSM 468 ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n"); 469 #endif 470 job->func(job); }}}
If the job is not a threaded one but persistent we can probably disarm it here.
However, other persistent, but threaded jobs we can not know if they are currently being executed in another thread or not. Thus due to the lack of locks / sync mechanisms we cannot change the job, because we just don't know if that cpu will ever see the change or if it will ever become consistent.
What if the job is ARMED here? {{{ 715 862 ns_job_type_t 716 863 ns_job_get_output_type(ns_job_t *job) 717 864 { 865 PR_ASSERT(job); 866 PR_ASSERT(job->state == NS_JOB_WAITING); 718 867 return job->output_job_type; }}}
If the job is ARMED, it may be in the event_q, or the event framework, or anywhere else. We cannot guarantee that the job->output_job_type synchronised across the cores. If the worker that is handling the job is the same thread as the one trying to read the output type this is probably okay, so I think I can change this assertion.
I just meant, in general, how do you disarm a persistent job? For example, if you want to modify a persistent job - you have to somehow disarm it first (e.g. in the job callback)?
persistent jobs are always armed? That is, do persistent jobs always have job->state == NS_JOB_ARMED? Or is it implicit that if a job is marked as PERSISTENT it is armed?
My question about output type is related - if in the job callback for a persistent job, can you call ns_job_get_output_type()? The job is PERSISTENT, so it isn't really WAITING, it is ARMED (explicitly or implicitly).
I think I have solved how to do this.
{{{ 466 } else if (NS_JOB_IS_PERSIST(job->job_type)) { 467 #ifdef DEBUG_FSM 468 ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n"); 469 #endif 470 job->func(job); 471 } else { 472 #ifdef DEBUG_FSM 473 ns_log(LOG_DEBUG, "moving to NS_JOB_WAITING\n", job->state); 474 #endif 475 (void)PR_AtomicSet(&job->state, NS_JOB_WAITING); 476 job->func(job); 477 } }}}
I think that if we changed this to;
{{{
466 } else { 467 #ifdef DEBUG_FSM 468 ns_log(LOG_DEBUG, "moving to NS_JOB_WAITING\n", job->state); 469 #endif 470 (void)PR_AtomicSet(&job->state, NS_JOB_WAITING); 471 job->func(job); 472 473 / If the job was marked done, this will be NEED_DELETE, so won't trigger. / 474 if (job->state == NS_JOB_WAITING && NS_JOB_IS_PERSIST(job->job_type)) { 475 (void)PR_AtomicSet(&job->state, NS_JOB_ARMED); 476 } 477 478 }
Something like that. This way, during the execution of the callback, the job is "waiting", but we automatically move to ARMED again afterwards if it's persist. This way the other functions you mention all work, we can make the job done, and we still get the persistent behavior.
What do you think? We would need to duplicate something like this for threaded persist jobs, but there are some race conditions involved with that ....
If you are happy with that, do you mind giving an ack, then I'll add another patch for the persist fix up?
I would really like for someone else to review this and understand what's going on.
OTOH, nhosoi has already given her endorsement.
ACK
commit 8826e03 commit 2934871 Compressing objects: 100% (25/25), done. Writing objects: 100% (27/27), 50.14 KiB | 0 bytes/s, done. Total 27 (delta 20), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/nunc-stans.git 7ef7ec1..7a4ce78 master -> master
Leaving open to resolve the ns_job_done on a persist non-threaded job.
attachment 0001-Ticket-57-Ability-to-disarm-a-persistent-job-from-wi.patch
commit 6e0a496 Writing objects: 100% (17/17), 8.21 KiB | 0 bytes/s, done. Total 17 (delta 12), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/nunc-stans.git 7a4ce78..a33c41f master -> master
Login to comment on this ticket.