#9695 Ansible Cleanup: Move systems to linux-system-roles networking
Opened 10 months ago by smooge. Modified 2 days ago

Describe what you would like us to do:

Many older systems are configured using various templates or hand configured when brought up. The 'new' ansible linux-system-roles for networking allow for us to get rid of these templates and use a standard method per OS to get things configured. Task would be to collect the mac addresses for each system and then make a pr request for the system host_vars data

Old method:

datacenter: iad2

br0_dev: eth1


New method:

# This virthost only has stg instances, so it doesn't freeze
freezes: false
nested: true

has_ipv4: yes
br0_ipv4_nm: 24

mgmt_mac: 2c:ea:7f:f3:6c:be
mac1: E4:43:4B:F7:B7:B8
mac2: E4:43:4B:F7:B7:BA
mac3: E4:43:4B:F7:B7:D8
mac4: E4:43:4B:F7:B7:D9

br0_port0_mac: "{{ mac1 }}"

  - name: br0
    state: up
    type: bridge
    autoconnect: yes
      - "{{ br0_ipv4 }}/{{ br0_ipv4_nm }}"
      gateway4: "{{ br0_ipv4_gw }}"
      - "{{ dns }}"
      - stg.iad2.fedoraproject.org
      - iad2.fedoraproject.org
      - fedoraproject.org
      dhcp4: no
      auto6: no
  - name: br0-port0
    state: up
    type: ethernet
    master: br0
    mac: "{{ br0_port0_mac }}"

Data that isn't known is ok to ask for more help on.

When do you need this to be done by? (YYYY/MM/DD)

Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: easyfix, low-gain, low-trouble, ops

10 months ago

I can take a look at this, but need some info to get started:
- Do we have a map of these "Many older systems "
- How to access these systems.
- What is the location for the pr request for the system host_vars data

The ansible repository is at https://pagure.io/fedora-infra/ansible and I will create a list of systems and information later today.

Looks like uploading tar balls does not work. I am putting the tar ball at https://smooge.fedorapeople.org/fedora-infra/ansible-ip-info.tgz

If @copperi doesnt have the time for this I can take it and start working on it.

@bodanel I have started, but sure you can work on this as well, there are about 500 machines and I have access to 50.

So far I have done:
pr_submitted proxy05.fedoraproject.org ... ... proxy05.fedoraproject.org moved to linux-system-roles networking
no response ... proxy06.fedoraproject.org
no response ... proxy09.fedoraproject.org
pr_submitted proxy10.iad2.fedoraproject.org ... proxy10.iad2.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy101.iad2.fedoraproject.org ... proxy101.iad2.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy11.fedoraproject.org ... proxy11.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy110.iad2.fedoraproject.org ... proxy110.iad2.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy12.fedoraproject.org ... proxy12.fedoraproject.org moved to linux-system-roles networking
no response ... proxy13.fedoraproject.org
pr_submitted proxy14.fedoraproject.org ... proxy14.fedoraproject.org moved to linux-system-roles networking
n/a proxy30.fedoraproject.org
n/a proxy31.fedoraproject.org
n/a proxy32.fedoraproject.org
n/a proxy33.fedoraproject.org
n/a proxy34.fedoraproject.org
n/a proxy35.fedoraproject.org
n/a proxy36.fedoraproject.org
n/a proxy37.fedoraproject.org
n/a proxy38.fedoraproject.org
n/a proxy39.fedoraproject.org
n/a proxy40.fedoraproject.org

Yes please do this in multiple small PR requests. We will want to be able to merge them, test them in blocks which small pr's will work better for.

When you can please have a look at https://pagure.io/fedora-infra/ansible/pull-request/481 and let me know if it is ok. If yes, I will start modifying blocks of servers.

done. I'll keep updating the tickets as I push my modifications

The following servers are done also


Can you please assing the ticket to me at least so I can find it more easy?

Below servers are commited

The following servers are done

Assigning to @bodanel :) Thanks for working on it.

Metadata Update from @kevin:
- Issue assigned to bodanel

9 months ago

buildvm-ppc64le-[11-40].iad2.fedoraproject.org done.

Metadata Update from @bodanel:
- Assignee reset

8 months ago

Metadata Update from @smooge:
- Issue assigned to bodanel

8 months ago

Hi. just a quick suggestion, can we open a tracker for this somewhere, it will be easy to navigate what's done and what's left. It Will save time, it will be done if we can have a markdown render for file something like this:

  • [ x ] file1
  • [ ] file2
    (Don't know why it is not rendered properly, just an idea)

so, one thing i have discovered/realized: we do not want to do this for vm's... only bare metal machines.

I think that should cut things down a lot. Basically only hosts that do not have vmhost: set on them.

Do you want to cleanup also existing VM's and let them on DHCP ?

No on dhcp, basically vm's are setup with static ips from our virt-install command and a few variables, so they always have the network setup we expect.
If we try and use linux system roles/networking on them, we have to install them, have the playbook fail, update the new mac address and re-run the playbook.
This is non ideal, instead we should just assume they have the correct static networking set when they were installed.
Does that make sense?

I'm a new contributor to the Fedora Infrastructure team and I'd like to help out with this issue.

If a task list would still be useful, I generated one using the filenames that were uploaded in the tar ball. I can generate a PR request if someone directs me to a path to place this list.

I can then mark the completed systems as done and remove the virtual machines from the list. I can also begin editing the host_vars files as well with some direction.

Can you just attach the list here?

Thanks for working on this!

Full list found here: https://pagure.io/9695_ansible_cleanup/blob/master/f/systems.md

I'll work on removing the vmhost items from the task list today.

it's not on the list, but just in case - please do not do this on openQA worker host boxes (openqa-*-worker*). They have very specific network config requirements and you need to know how openQA works in order to change the networking config and make sure the changes work OK.

Full list found here: https://pagure.io/9695_ansible_cleanup/blob/master/f/systems.md

I'll work on removing the vmhost items from the task list today.

vmhost are fine to stay. They are bare metal. :)
it's 'buildvm*' that are all vm's...

Okay all 'buildvm*' hosts removed from the tasklist, reducing the number of hosts from 482 to 310.

Also all items mentioned as completed in the comments have been marked as completed in the tasklist.

I'm going to start working on a block of bvmhost-a64 machines. Can someone check over this first one and see if I'm configuring the bridge correctly? I also wasn't sure what to do with br0_dev.

So, we need also mac address in a - name: br0-port0 section... I am not sure how best to get the list of mac addresses to you. Could put them in a file on batcave01?

Basically the idea here is that when we specify mac address, linux system roles/networking will know exactly what interface we mean, without us having to call it eth0 or enasaskjdfjdkshgdysrjsdhfjs1 or whatever, and it will know to add that to a bridge, etc.

Right that makes sense about the br0-port0 section. Sure, just let me know where the file is located on batcave01.

Got it! Also should the dns and dns_search sections just match the original "New method:" example?

ok, I just copied the ansible facts cache for bvmhost* to /var/tmp/bvmhost-facts-cache/ so you should be able to get macaddress: info from there.

Yes, dns* should also be set. :)

Alright I updated the host_vars file for the first bvmhost-a64 server seen here at https://pagure.io/fedora-infra/ansible/pull-request/663. After a successful build I will have time to work on additional bvmhost-a64 servers this week.

I just made a pull request for the majority of the bvmhost servers https://pagure.io/fedora-infra/ansible/pull-request/663.

Checklist updated.

Three servers were missing from the files dumped in the batcave01 /var/tmp/bvmhost-facts-cache directory:
- bvmhost-a64-10.iad2.fedoraproject.org
- bvmhost-p08-03.iad2.fedoraproject.org
- bvmhost-p08-04.iad2.fedoraproject.org

Ready for additional mac addresses. Perhaps buildvm* servers?

I'm ready for another mac address dump.
- copr
- vmhost

- bvmhost-a64-10.iad2.fedoraproject.org
- bvmhost-p08-03.iad2.fedoraproject.org
- bvmhost-p08-04.iad2.fedoraproject.org

copr* are in aws, can be excluded.

These 3 are all down. The first has a bad disk issue, the other 2 are being decomissioned.
- bvmhost-a64-10.iad2.fedoraproject.org
- bvmhost-p08-03.iad2.fedoraproject.org
- bvmhost-p08-04.iad2.fedoraproject.org

vmhost* is in /var/tmp/vmhosts/ on batcave01. :)

Tasklist has been updated. I also went ahead and removed servers on AWS. Not that many remaining!

I put the remaining server names in a text file if that helps. I'm ready for the last mac address dump whenever.

So that text file now includes these:


as I mentioned above, please DO NOT just convert these over. They, especially the ones in the openqa_tap_workers group, have very specific networking requirements that are likely not covered by system roles yet. I'm happy to talk with you regarding those requirements any time, just ping me in IRC/Matrix or something.

Metadata Update from @bodanel:
- Assignee reset

4 months ago

The following can be dropped / removed:

sign-vault01 (it doesn't get managed by ansible)
kernel01 (it's not managed by us)
bvmhost-a64-10 (it's down, we will have to add it later when it's fixed)

These should possibly not be in ansible/inventory anymore?

virthost-cloud01 (this doesn't exist anymore)
buildhw-a64-07.iad2.fedoraproject.org (dead hw)
buildhw-a64-09.iad2.fedoraproject.org (dead hw)
buildhw-a64-10.iad2.fedoraproject.org (dead hw)
host1plus (no long exists)
osuosl03 (no longer exists)
proxy07.fedoraproject.org (no longer exists)
virthost-rdu02 (no longer exists)

The rest of their ansible facts should be in /tmp9695 on batcave. ;)

so from a quick look at the system roles stuff, I don't think it supports most of the advanced network stuff we need on the openQA worker hosts. So I think the strategy will be just to switch the configuration of the main physical interface(s) and possibly the bridge interface to use system roles, and continue setting the rest up the way we currently do in the plays. A couple of initial questions to figure out:

  1. is system-roles OK with having both network.service and NetworkManager.service enabled, or will it always want to disable one of them?
  2. will system-roles be confused by, or try to mess with, configuration files created outside of its purview?

I'm gonna poke around a bit more tomorrow and maybe try to sketch out the changes for one host to see how it'd look.

I quickly looked at the systems roles documentation too and reached the same conclusion. I think it should be pretty straightforward to move the physical interface(s) over to use system-roles, assuming the two questions you brought up aren't an issue.

I'll be at work tomorrow, but I might have a few minutes free if you want to discuss changing over the first host.

Sorry, wasn't able to get back to this today, had some other things to work on. Will come back to it next week.

FYI, this is going to need to wait until after freeze at this point.

yeah, sorry, I didn't get around to it for the openqa workers in the end. will try to get to it after freeze.

Merged that last PR.

Whats left here? openqa and thats it?

I did the openQA workers yesterday:

I think they're okay, nothing obviously blew up anyway. It's only possible to use system-roles to configure the regular physical interfaces, setup of the bridge and tap devices on the tap worker hosts is still in the openqa/worker role, I added a comment explaining this.

Thanks for the work on the openQA workers Adam!

Looks like there are a few stragglers left:
* virthost-rdu01.fedoraproject.org

The ansible facts casche for those is in /tmp/9695 on batcave01.

Is there any way I can help or is it mostly done? Was just looking at open issues.

Mostly done. Last bit waiting for final freeze to be lifted.

Looks like I'll need the macs dumped for these again as they're gone from /tmp on batcave01:

Looks like there are a few stragglers left:
* virthost-rdu01.fedoraproject.org

In /tmp/ticket-9695/ on batcave01.

Metadata Update from @kevin:
- Issue tagged with: unfreeze

a month ago

I messed up your PR by sorting all the hosts and group vars files. ;(

Can you rework the PR for that?

Ya no worries, I'll sort that out soon.

I believe I was able to successfully merge the changes from sorting the yaml host vars. See new pull request https://pagure.io/fedora-infra/ansible/pull-request/898.

The PR seems to revert all the vars changes or something? in any case it's got like 300+ files changed. ;(

Should just be a few...

Alright that's no good. See new pull request here (crosses fingers) https://pagure.io/fedora-infra/ansible/pull-request/899.

ok. Got those to mostly work. A few issues I hit:

  • had to remove auto6: false because if you set that it errors because dns servers are for both ipv4 and ipv6 and with ipv6 disabled it can't set them there.
  • buildvmhost-s390x-01 is a weird host, it has a different mac address on it's bridge from it's network
  • ibiblio01/05 are weird because their have bond devices. I disabled them for now from using this.
  • internetx had a incorrect gateway and was missing br0_ipv6_gw (fixed it)

So, looking at all hosts, dropping those that are cloud hosts, I see the following that still need fixing:


And we should sort the bond devices on ibiblio01/05, but thats going to be a bit complex. ;(

Yeah I had some issues figuring out the last round of hosts because they were a bit different.

If you dump those remaining hosts on over to batcave01 I can take a look.

What do we need to do to sort out ibiblio01/05?

Finally what did you use to sort yaml? I hacked together something quick with Python, but just curious.

ok, in /tmp/9695 on batcave01 is the facts cache for those hosts.

remember to drop the auto6: false. :)

For sorting, I used 'yq'. https://github.com/mikefarah/yq/releases

yq eval 'sortKeys(..)' filename

Login to comment on this ticket.

Boards 1
ops Status: Backlog