#8684 docs.stg.fp.o doesn't build, openshift cronjob confused
Closed: Fixed 4 years ago by asamalik. Opened 4 years ago by asamalik.

Describe what you would like us to do:

The staging fedora docs site is not rebuilding. There is the following error:

"Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew."

https://console.app.os.stg.fedoraproject.org/k8s/ns/docsbuilding/cronjobs/build/events

I believe that deleting this job:

https://console.app.os.stg.fedoraproject.org/k8s/ns/docsbuilding/jobs/build-1573408800

... would fix the problem, but I don't have permissions for that.


When do you need this to be done by? (YYYY/MM/DD)

Of course ASAP :) but I don't have any hard deadline



You can actually delete an object but you have to use ansible for that. You can also make the ansible task only play when a tag is specify.

You can find an example here --> https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/playbooks/openshift-apps/coreos-koji-tagger.yml#n65

You can adapt the type of the object to delete the cron or just reuse that and delete the whole project and then recreate it running the playbook again.

Would that work for you ?

Metadata Update from @cverna:
- Issue priority set to: Waiting on External (was: Needs Review)

4 years ago

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Waiting on External)

4 years ago

Metadata Update from @cverna:
- Issue priority set to: Waiting on External (was: Waiting on Assignee)
- Issue tagged with: OpenShift

4 years ago

@asamalik @jibecfed could you do the changes in ansible and reopen if that was not enough.

Metadata Update from @cverna:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Cl=C3=A9ment, thanks for your answer=2E
It would be greatly appreciated if you could solve it=2E
You'll probably save Adam some pain=2E

I personally have no knowledge in Ansible and did not understand your solu=
tion=2E Maybe he didn't catch it either?

The issue isn't fixed, our users still are impacted=2E Users are community=
members testing the new i18n system=2E
Please reopen=2E

Ok this was done in https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=a29d21317fc211ef93a7a9a71e3be84530d72f8e

You can now delete the cronjob by running

sudo rbac-playbook -l os_masters_stg[0] -t delete openshift-apps/docsbuilding.yml

for staging or

sudo rbac-playbook -l os_masters[0] -t delete openshift-apps/docsbuilding.yml

for production.

To redeploy the cronjob you can run the playbook

sudo rbac-playbook openshift-apps/docsbuilding.yml

I have also open 2 PRs to update the Fedora version of the base image (moving to F31)
- https://pagure.io/fedora-docs/docs-fp-o/pull-request/131
- https://pagure.io/fedora-docs/docs-fp-o/pull-request/130

Hope that helps.

Thanks for the PRs, I'll look at those asap, but not today.

It looks like the cronjob has been deleted instead of the job — so the site is not building. I don't have time to look into this today, but I believe putting the cronjob back, and making sure there is no job stuck as it was before will fix it.

Sorry, I don't know how or where to run the 'rbac-playbook' utility, and have no time to learn this at this very moment.

Metadata Update from @asamalik:
- Issue status updated to: Open (was: Closed)

4 years ago

Thanks for the PRs, I'll look at those asap, but not today.

It looks like the cronjob has been deleted instead of the job — so the site is not building. I don't have time to look into this today, but I believe putting the cronjob back, and making sure there is no job stuck as it was before will fix it.

I ll investigate, my understanding is that a Job is created by the cronjob everytime it runs.

Sorry, I don't know how or where to run the 'rbac-playbook' utility, and have no time to learn this at this very moment.

A bit more info available here (https://fedora-infra-docs.readthedocs.io/en/latest/sysadmin-guide/sops/ansible.html).

If you want something more self-serivce (ie no going through ansible) maybe communishift would be better fit for this project.

Thanks for the PRs, I'll look at those asap, but not today.
It looks like the cronjob has been deleted instead of the job — so the site is not building. I don't have time to look into this today, but I believe putting the cronjob back, and making sure there is no job stuck as it was before will fix it.

I ll investigate, my understanding is that a Job is created by the cronjob everytime it runs.

Ok so the cronjob is back in place and a job was triggered

oc -n docsbuilding get jobs
NAME              DESIRED   SUCCESSFUL   AGE
cron-1583319600   1         0            24m

a pod in running and building the docs

oc -n docsbuilding get pods
NAME                     READY     STATUS      RESTARTS   AGE
builder-build-35-build   0/1       Completed   0          8d
builder-build-36-build   0/1       Completed   0          3h
builder-build-37-build   0/1       Completed   0          59m
cron-1583319600-vrhvh    1/1       Running     0          25m

I am not sure how long it takes to build the docs, also it seems that there are a few errors in the logs. I can do much here to help investigate.

Let me know if you need anything else or if we can close this.

Thanks!

The build itself is running, so we should be good.

(It takes sometime, but I can take it from there.)

Cheers!

Metadata Update from @asamalik:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Login to comment on this ticket.

Metadata