#11264 noc.yml playbook fails for noc02.fedoraproject.org on ibiblio-hosts-ipv6.cfg template
Closed: Fixed a year ago by praiskup. Opened a year ago by praiskup.

TASK [nagios_server : Build out nagios host templates (production)] ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Monday 24 April 2023  06:22:20 +0000 (0:06:03.868)       0:25:52.000 **********
Monday 24 April 2023  06:22:20 +0000 (0:06:03.868)       0:25:52.000 **********
skipping: [noc01.iad2.fedoraproject.org] => (item=iad2-external.cfg)      
skipping: [noc01.iad2.fedoraproject.org] => (item=ibiblio-hosts-ipv6.cfg)                                                                                                                                                                                                                                                                                                                                                                                                                   
ok: [noc02.fedoraproject.org] => (item=iad2-external.cfg)                      
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'                                                                                                                                                                                                                                                                   
failed: [noc02.fedoraproject.org] (item=ibiblio-hosts-ipv6.cfg) => {"ansible_loop_var": "item", "changed": false, "item": "ibiblio-hosts-ipv6.cfg", "msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}

TASK [nagios_server : Build out nagios hostgroup templates (iad2)] *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Monday 24 April 2023  06:22:49 +0000 (0:00:29.215)       0:26:21.216 **********
Monday 24 April 2023  06:22:49 +0000 (0:00:29.215)       0:26:21.215 **********
...

Seems like facts are not gathered for that host (which seems OK, because we tend to run against noc?). But how that could ever work?

i'm getting the same issue trying to run the torrent.yml playbook

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation

a year ago

The torrent playbook is complaining about ansible_python variable, which is not the same, but it seems that something is not getting populated for VMs playbooks. I'm trying to find the root of this, but no luck so far.

Usually this error is covering a different one in a template where something in a template has some set of variables, but a specific host does not have it set up. It generally requires going through the template and or running the playbook with significant amount of verbosity to see which one it died on. My apologies on this as my nagion jira templates have caused these regularly over the years.

I am not sure what is going on with the torrent.yml one.

After the investigation it seems that the issue was with /root/.ansible_facts_cache/ibiblio05.fedoraproject.org which was just an empty file. @smooge fixed that with

rm /root/.ansible_facts_cache/ibiblio05.fedoraproject.org ; ansible ibiblio05.fedoraproject.org -m ansible.builtin.setup

This command will remove the empty file and create a new one by gathering facts from ibiblio05.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a year ago

Can this be the same problem for ibiblio01.fedoraproject.org?

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

a year ago

I mean this:
https://pagure.io/fedora-infra/ansible/blob/dc0b0f1d7eaa7ed45d7c60a5fd04766e894aa44a/f/inventory/host_vars/noc02.fedoraproject.org#_52

The error looks a bit different now (not sure what has changed the behavior). But the problem appears the same (facts not gathered):

Now it fails here:

TASK [get vm list] *************************************************************************************************************************************************************************************************************************
Tuesday 25 April 2023  13:20:07 +0000 (0:00:00.106)       0:00:00.278 ********* 
Tuesday 25 April 2023  13:20:07 +0000 (0:00:00.106)       0:00:00.278 ********* 
fatal: [noc02.fedoraproject.org -> ibiblio01.fedoraproject.org]: FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python'"}

Yes something failed in the cache gathering and so the following hosts only had 'minimal' settings:

-rw-------. 1 root root     71 Apr 23 04:19 autosign01.stg.iad2.fedoraproject.org
-rw-------. 1 root root     58 Apr 25 13:01 ibiblio01.fedoraproject.org
-rw-------. 1 root root     59 Apr 23 04:20 kernel01.iad2.fedoraproject.org
-rw-------. 1 root root     58 Apr 24 15:32 os-control01.stg.iad2.fedoraproject.org
-rw-------. 1 root root     71 Apr 25 05:02 vmhost-x86-07.iad2.fedoraproject.org
-rw-------. 1 root root     58 Apr 25 05:02 vmhost-x86-08.iad2.fedoraproject.org

autosign adn kernel01 are always like that due to limited logins but the other systems should have more.

Thanks, it helped (at least partly) because I was able to start the playbook now. Currently being run...

I have cleared out and redone the cache for those systems.

Thank you, closing then!

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Hm, sorry for reopening over again, but we are back to the previous error message:

TASK [nagios_server : Build out nagios host templates (production)] ************************************************************************************************************************************************************************
Tuesday 25 April 2023  14:21:56 +0000 (0:04:50.018)       0:22:23.351 ********* 
Tuesday 25 April 2023  14:21:56 +0000 (0:04:50.018)       0:22:23.351 ********* 
skipping: [noc01.iad2.fedoraproject.org] => (item=iad2-external.cfg) 
skipping: [noc01.iad2.fedoraproject.org] => (item=ibiblio-hosts-ipv6.cfg) 
ok: [noc02.fedoraproject.org] => (item=iad2-external.cfg)
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'
failed: [noc02.fedoraproject.org] (item=ibiblio-hosts-ipv6.cfg) => {"ansible_loop_var": "item", "changed": false, "item": "ibiblio-hosts-ipv6.cfg", "msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

a year ago

I think it was that you ran it while I was still trying to fix things. Each of the hosts has that variable defined:

download-ib01.fedoraproject.org:    "ansible_hostname": "download-ib01",
ibiblio01.fedoraproject.org:    "ansible_hostname": "ibiblio01",
ibiblio05.fedoraproject.org:    "ansible_hostname": "ibiblio05",
noc02.fedoraproject.org:    "ansible_hostname": "noc02",
ns02.fedoraproject.org:    "ansible_hostname": "ns02",
people02.fedoraproject.org:    "ansible_hostname": "people02",
proxy04.fedoraproject.org:    "ansible_hostname": "proxy04",
proxy12.fedoraproject.org:    "ansible_hostname": "proxy12",
smtp-mm-ib01.fedoraproject.org:    "ansible_hostname": "smtp-mm-ib01",
torrent02.fedoraproject.org:    "ansible_hostname": "torrent02",
unbound-ib01.fedoraproject.org:    "ansible_hostname": "unbound-ib01",

I will run by hand and see if I get it. Please do not run anything until I give the all-clear in the ticket.

I will run by hand and see if I get it. Please do not run anything until I give the all-clear in the ticket.

ACK, thanks!

the torrent playbook i was trying to run worked now, fwiw

I experimented with the PR 1394, and later reverted. The problematic hostname is smtp-mm-ib01.fedoraproject.org

define host {
   use                     mincheck6
   host_name               smtp-mm-ib01.fedoraproject.org
   address                 smtp-mm-ib01.fedoraproject.org
   parents                 ibiblio05.fedoraproject.org
}

That's the only one where we don't have the ansible_hostname defined for some reason.

I don't have root access on Batcave to do:

hostname=smtp-mm-ib01.fedoraproject.org
rm "/root/.ansible_facts_cache/$hostname"
ansible "$hostname" -m ansible.builtin.setup

But this seems like a brittle concept. Don't we want to always gather the facts instead (for every single playbook run)?

The problem of 1394 is that it created duplicated sets of host specifications, so
we could also rework the 'ipv6 template entirely (or even drop it).

Ok that helped, but I had to at least fix one host_name reference. So there's a new locality to observe it it's status. Hopefully, there are no more problems like that :-( lemme know if I caused more harm than good.

I was able to run noc.yml successfully now.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

The reason we don't collect facts every time is that it was slowing down a lot of runs of large playbooks. When you are running a set of playbooks over a hundred hosts or more.. some of those hosts take multiple minutes to report back the facts if they are busy with other tasks. That was making some playbook runs take hours when they should have been 30 minutes or more.

The issue is that the variable ansible_hostname is used in pretty much every template but only that and the torrent template seemed to be affected (currently). We need to find a better way to deal with this over the other templates for any future problems.

I tried to check ansible_hostname in the nagios templates, and this was the remaining place where we used it without checking if it is defined first (if I read the templates correctly). Anyway, hope the fix is OK / helpful - some further review would make sense I think. :-) Cheers!

Yeah, I am puzzled why just those hosts... we do have a ansible plugin we made a long time ago that forces caching ansible_python back when it was broken in facts...

Login to comment on this ticket.

Metadata