#57 Implement a strict state machine for NuncStans to validate the library
Closed: Fixed None Opened 7 years ago by firstyear.

It is critical for Nunc-Stans to maintain consistent, and correct state of jobs. Without this, many unknown and complex side effects can occur in an async event library such as this which would be nearly impossible to debug.

To assert the correctness of this we should implement nunc-stans jobs to follow a state machine. When a job is created it is "WAITING" further instruction and setup. When the job is set to rearm, or it is created as an ns_add_*job, it is moved to ARMED. While ARMED tasks such as setting data, or modifying the job should not be possible. When the job is triggered, it is moved to the WAITING state for re-arming, or in the case of persistent jobs, it remains ARMED.

When a job is deleted it moves from WAITING to NEEDS_DELETE. The event api then collects this and removes the job, moving it in the interim to DELETED, then freeing it.

A job that is ARMED cannot be set to NEEDS_DELETE, because if the event api frees the job while a call back in processing, horrible race conditions can occur. Additionally, it may cause the valid states of the callback to be violated. A job must be WAITING to be set to NEEDS_DELETE.

This however does not hold during server shutdown. During shutdown all jobs regardless of state are moved to NEEDS_DELETE, then removed. In this case, the event_queue is not processing which means the race conditions cannot occur.

The enforcement of this state machine is enabled with -DDEBUG due to the use of PR_ASSERT. It's extraordinarily valuable to run development applications with this to find faults in their usage of nunc-stans.

Additionally, due to this, this discovered a number of logic, heap use after free, and double free issues. These are fixed with this ticket.


There is another param -DDEBUG_FSM that gives extra logging of the state transitions and decisions.

This builds and runs correctly with Directory Server master, under ldclt also.

Don't need to re-run the ds reliab test, they don't use this code. Passes my echo server test.

How do you disarm a persistent job?

At this point, the job is not threaded?
{{{

466     } else if (NS_JOB_IS_PERSIST(job->job_type)) { 
467 #ifdef DEBUG_FSM 
468         ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n"); 
469 #endif 
470         job->func(job);

}}}

What if the job is ARMED here?
{{{

715 862 ns_job_type_t
716 863 ns_job_get_output_type(ns_job_t *job)
717 864 {
865 PR_ASSERT(job);
866 PR_ASSERT(job->state == NS_JOB_WAITING);
718 867 return job->output_job_type;
}}}

Replying to [comment:7 rmeggins]:

How do you disarm a persistent job?

At this point, the job is not threaded?
{{{

466 } else if (NS_JOB_IS_PERSIST(job->job_type)) {
467 #ifdef DEBUG_FSM
468 ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n");
469 #endif
470 job->func(job);
}}}

If the job is not a threaded one but persistent we can probably disarm it here.

However, other persistent, but threaded jobs we can not know if they are currently being executed in another thread or not. Thus due to the lack of locks / sync mechanisms we cannot change the job, because we just don't know if that cpu will ever see the change or if it will ever become consistent.

What if the job is ARMED here?
{{{

715 862 ns_job_type_t
716 863 ns_job_get_output_type(ns_job_t *job)
717 864 {
865 PR_ASSERT(job);
866 PR_ASSERT(job->state == NS_JOB_WAITING);
718 867 return job->output_job_type;
}}}

If the job is ARMED, it may be in the event_q, or the event framework, or anywhere else. We cannot guarantee that the job->output_job_type synchronised across the cores. If the worker that is handling the job is the same thread as the one trying to read the output type this is probably okay, so I think I can change this assertion.

I just meant, in general, how do you disarm a persistent job? For example, if you want to modify a persistent job - you have to somehow disarm it first (e.g. in the job callback)?

persistent jobs are always armed? That is, do persistent jobs always have job->state == NS_JOB_ARMED? Or is it implicit that if a job is marked as PERSISTENT it is armed?

My question about output type is related - if in the job callback for a persistent job, can you call ns_job_get_output_type()? The job is PERSISTENT, so it isn't really WAITING, it is ARMED (explicitly or implicitly).

I think I have solved how to do this.

{{{
466 } else if (NS_JOB_IS_PERSIST(job->job_type)) {
467 #ifdef DEBUG_FSM
468 ns_log(LOG_DEBUG, "PERSIST, leaving as NS_JOB_ARMED\n");
469 #endif
470 job->func(job);
471 } else {
472 #ifdef DEBUG_FSM
473 ns_log(LOG_DEBUG, "moving to NS_JOB_WAITING\n", job->state);
474 #endif
475 (void)PR_AtomicSet(&job->state, NS_JOB_WAITING);
476 job->func(job);
477 }
}}}

I think that if we changed this to;

{{{

466 } else {
467 #ifdef DEBUG_FSM
468 ns_log(LOG_DEBUG, "moving to NS_JOB_WAITING\n", job->state);
469 #endif
470 (void)PR_AtomicSet(&job->state, NS_JOB_WAITING);
471 job->func(job);
472
473 / If the job was marked done, this will be NEED_DELETE, so won't trigger. /
474 if (job->state == NS_JOB_WAITING && NS_JOB_IS_PERSIST(job->job_type)) {
475 (void)PR_AtomicSet(&job->state, NS_JOB_ARMED);
476 }
477
478 }

}}}

Something like that. This way, during the execution of the callback, the job is "waiting", but we automatically move to ARMED again afterwards if it's persist. This way the other functions you mention all work, we can make the job done, and we still get the persistent behavior.

What do you think? We would need to duplicate something like this for threaded persist jobs, but there are some race conditions involved with that ....

If you are happy with that, do you mind giving an ack, then I'll add another patch for the persist fix up?

I would really like for someone else to review this and understand what's going on.

OTOH, nhosoi has already given her endorsement.

ACK

commit 8826e03
commit 2934871
Compressing objects: 100% (25/25), done.
Writing objects: 100% (27/27), 50.14 KiB | 0 bytes/s, done.
Total 27 (delta 20), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/nunc-stans.git
7ef7ec1..7a4ce78 master -> master

Leaving open to resolve the ns_job_done on a persist non-threaded job.

commit 6e0a496
Writing objects: 100% (17/17), 8.21 KiB | 0 bytes/s, done.
Total 17 (delta 12), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/nunc-stans.git
7a4ce78..a33c41f master -> master

Login to comment on this ticket.

Metadata