#663 Repeat actions on failure
Closed 4 years ago by frostyx. Opened 4 years ago by frostyx.
copr/ frostyx/copr action-repeat-on-failure  into  master

file modified
+1 -5
@@ -513,11 +513,7 @@ 

              if result.result == ActionResult.SUCCESS and \

                      not getattr(result, "job_ended_on", None):

                  result.job_ended_on = time.time()

- 

-             try:

-                 self.frontend_client.update({"actions": [result]})

-             except RequestException as e:

-                 self.log.exception(e)

+         return result

  

  

  # TODO: sync with ActionTypeEnum from common

@@ -7,7 +7,7 @@ 

  

  from backend.frontend import FrontendClient

  

- from ..actions import Action

+ from ..actions import Action, ActionResult

  from ..helpers import get_redis_logger

  

  
@@ -56,6 +56,12 @@ 

          self.log.info("Got new action_task %s of type %s", action_task['id'], action_task['action_type'])

          return Action(self.opts, action_task, frontend_client=self.frontend_client)

  

+     def update_action(self, result):

+         try:

+             self.frontend_client.update({"actions": [result]})

+         except RequestException as e:

+             self.log.exception(e)

+ 

      def run(self):

          """

          Executes action dispatching process.
@@ -66,7 +72,11 @@ 

          while True:

              action = self.load_action()

              try:

-                 action.run()

+                 result = action.run()

+                 if result.result == ActionResult.FAILURE:

+                     self.log.error("Action failed, but we want to repeat it: {}".format(action))

+                     continue

+                 self.update_action(result)

Well, this part makes it really far from obvious :-( the action.run() method itself is able to post FAILURE state to frontend (unless there's a traceback which prevents it from doing so). I'd much prefer to take care exceptions and failures withing action.run() directly, so from the outside it has absolutely clear semantics -- "call action.run()". Ideally, that method shouldn't even need to be wrapped by action.run().

he action.run() method itself is able to post FAILURE state to frontend

Well, it shouldn't be anymore. I've moved the code to the action dispatcher, in the first commit.

Sorry, I can't even read my previous commit now :-) this version looks better .. but it still feels like we should rather deal with this inside action.run().

Are you sure that ActionResult.FAILURE: implies that we want to repeat it...?

but it still feels like we should rather deal with this inside action.run().

I can put it back, no problem. I just felt that updating frontend shouldn't be part of the action.run(), but rather a separate method. The same way, we don't load action definition within action.run(), I wouldn't even update it. But mainly, my motivation was to handle FAILURE actions and exceptions from actions in the same place.

Are you sure that ActionResult.FAILURE: implies that we want to repeat it...?

I am not, actually. But what is the status quo? See the commit message for detailed description, ... but If an action fails because of an exception, it is repeated over and over again and blocks the queue until it is finished. if an action ends with ActionResult.FAILURE, we mark it as failed in the database and that's it. I am convinced, that we want to handle those two kinds of errors the same way. I don't particularly care how, but when someone says "An action failed", I wanted to be sure what happened and not ask "well and failed it with an exception or the action resulted in failure?" and have two possible scenarios how to deal with the issue.

Now, what should be the common behavior? Since we don't have any frontend UI for actions and don't indicate users, that their actions failed, I have modified the ActionResult.FAILURE actions to behave just like those what ended with traceback (i.e. repeat again and block the queue). If you prefer it having the other way around, which I in fact do, we would have to brainstorm how to show failed actions to a user. It may take some time to implement though.

              except Exception as e: # dirty

                  self.log.exception(str(e))

              msg = "Started new action {} of type {}"\

See #652

We currently deal with action errors in two different ways. Let's
say, that we want to delete a project and shutil.rmtree raises
OSError because for example #636, then action is not updated
on the frontend (i.e. its result is still WAITING), unhandled
exception bubbles out to the action dispatcher and then the same
action is fetched again and again and blocks the action queue
until it is successfully finished.

On the other hand, action can expectedly fail and result in
ActionResult.FAILURE, like for example in #652. Then the action
is finished with failure, thrown away and next action is popped
from the queue.

What is the difference between those two cases from the user
perspective? IMHO there is none and both should be handled the
same way. This change allows actions that resulted in FAILURE
to be processed again.

Well, this part makes it really far from obvious :-( the action.run() method itself is able to post FAILURE state to frontend (unless there's a traceback which prevents it from doing so). I'd much prefer to take care exceptions and failures withing action.run() directly, so from the outside it has absolutely clear semantics -- "call action.run()". Ideally, that method shouldn't even need to be wrapped by action.run().

he action.run() method itself is able to post FAILURE state to frontend

Well, it shouldn't be anymore. I've moved the code to the action dispatcher, in the first commit.

Sorry, I can't even read my previous commit now :-) this version looks better .. but it still feels like we should rather deal with this inside action.run().

Are you sure that ActionResult.FAILURE: implies that we want to repeat it...?

but it still feels like we should rather deal with this inside action.run().

I can put it back, no problem. I just felt that updating frontend shouldn't be part of the action.run(), but rather a separate method. The same way, we don't load action definition within action.run(), I wouldn't even update it. But mainly, my motivation was to handle FAILURE actions and exceptions from actions in the same place.

Are you sure that ActionResult.FAILURE: implies that we want to repeat it...?

I am not, actually. But what is the status quo? See the commit message for detailed description, ... but If an action fails because of an exception, it is repeated over and over again and blocks the queue until it is finished. if an action ends with ActionResult.FAILURE, we mark it as failed in the database and that's it. I am convinced, that we want to handle those two kinds of errors the same way. I don't particularly care how, but when someone says "An action failed", I wanted to be sure what happened and not ask "well and failed it with an exception or the action resulted in failure?" and have two possible scenarios how to deal with the issue.

Now, what should be the common behavior? Since we don't have any frontend UI for actions and don't indicate users, that their actions failed, I have modified the ActionResult.FAILURE actions to behave just like those what ended with traceback (i.e. repeat again and block the queue). If you prefer it having the other way around, which I in fact do, we would have to brainstorm how to show failed actions to a user. It may take some time to implement though.

But what is the status quo?

:/ not good, unfortunately, and I agree with your POV ..., though ...

If an action fails because of an exception, it is repeated over and over
again

... because nobody decided yet what to do with it ...

if an action ends with ActionResult.FAILURE, we mark it as failed in the
database and that's it.

... and in this case we explicitly decided so.

I am convinced, that we want to handle those two kinds of errors the
same way.

In case of unhandled exception of any kind (which doesn't explicitly say
neither that we want to FAILURE the task, nor that we want to re-try) I'm
rather tend to disagree with you and I think we should manually resolve
the problems.

Even though we don't have "transactional behavior", or something which
allows us to track that some actions did not happen - I feel that solving
this generally isn't possible and we should go the case-by-case (per
action) way:

  • make sure that the action (e.g. gpg key gen) returns FAILURE only if
    we really don't want to give it another try, and otherwise continue
    with retry...
  • make sure that the action doesn't raise exceptions, but if it does ..
    take it really seriously, and ...
  • report copr administrators that there are exceptions which needs
    manual intervention (e.g. new state Action.NEEDSACTION).

This way we would be sure that we do not loose any action... For example,
as long as there's no unresolved action (NEEDSACTION flag), the repository
is consistent with FE DB (aka no more undeleted builds on backend, but
deleted from database).

I don't particularly care how, but when someone says "An action failed",

Agreed, would Action.NEEDSACTION (or similar) help with this?

I wanted to be sure what happened and not ask "well and failed it with
an exception or the action resulted in failure?"

Agreed, this really needs to be solved!

Metadata Update from @frostyx:
- Pull-request tagged with: needs-work

4 years ago

I've created an RFE #1108 and closing this PR.

Pull-Request has been closed by frostyx

4 years ago