Many older systems are configured using various templates or hand configured when brought up. The 'new' ansible linux-system-roles for networking allow for us to get rid of these templates and use a standard method per OS to get things configured. Task would be to collect the mac addresses for each system and then make a pr request for the system host_vars data
Old method:
--- datacenter: iad2 br0_ip: 10.3.170.11 br0_nm: 255.255.255.0 br0_gw: 10.3.170.254 br0_dev: eth1 dns: 10.3.163.33
New method:
--- # This virthost only has stg instances, so it doesn't freeze freezes: false nested: true dns: 10.3.163.33 has_ipv4: yes br0_ipv4: 10.3.166.28 br0_ipv4_nm: 24 br0_ipv4_gw: 10.3.166.254 mgmt_mac: 2c:ea:7f:f3:6c:be mgmt_ipv4: 10.3.160.46 mac1: E4:43:4B:F7:B7:B8 mac2: E4:43:4B:F7:B7:BA mac3: E4:43:4B:F7:B7:D8 mac4: E4:43:4B:F7:B7:D9 br0_port0_mac: "{{ mac1 }}" network_connections: - name: br0 state: up type: bridge autoconnect: yes ip: address: - "{{ br0_ipv4 }}/{{ br0_ipv4_nm }}" gateway4: "{{ br0_ipv4_gw }}" dns: - "{{ dns }}" dns_search: - stg.iad2.fedoraproject.org - iad2.fedoraproject.org - fedoraproject.org dhcp4: no auto6: no - name: br0-port0 state: up type: ethernet master: br0 mac: "{{ br0_port0_mac }}"
Data that isn't known is ok to ask for more help on.
Metadata Update from @mohanboddu: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: easyfix, low-gain, low-trouble, ops
I can take a look at this, but need some info to get started: - Do we have a map of these "Many older systems " - How to access these systems. - What is the location for the pr request for the system host_vars data
The ansible repository is at https://pagure.io/fedora-infra/ansible and I will create a list of systems and information later today.
Looks like uploading tar balls does not work. I am putting the tar ball at https://smooge.fedorapeople.org/fedora-infra/ansible-ip-info.tgz
If @copperi doesnt have the time for this I can take it and start working on it.
@bodanel I have started, but sure you can work on this as well, there are about 500 machines and I have access to 50.
So far I have done: pr_submitted proxy05.fedoraproject.org ... ... proxy05.fedoraproject.org moved to linux-system-roles networking no response ... proxy06.fedoraproject.org no response ... proxy09.fedoraproject.org pr_submitted proxy10.iad2.fedoraproject.org ... proxy10.iad2.fedoraproject.org moved to linux-system-roles networking pr_submitted proxy101.iad2.fedoraproject.org ... proxy101.iad2.fedoraproject.org moved to linux-system-roles networking pr_submitted proxy11.fedoraproject.org ... proxy11.fedoraproject.org moved to linux-system-roles networking pr_submitted proxy110.iad2.fedoraproject.org ... proxy110.iad2.fedoraproject.org moved to linux-system-roles networking pr_submitted proxy12.fedoraproject.org ... proxy12.fedoraproject.org moved to linux-system-roles networking no response ... proxy13.fedoraproject.org pr_submitted proxy14.fedoraproject.org ... proxy14.fedoraproject.org moved to linux-system-roles networking n/a proxy30.fedoraproject.org n/a proxy31.fedoraproject.org n/a proxy32.fedoraproject.org n/a proxy33.fedoraproject.org n/a proxy34.fedoraproject.org n/a proxy35.fedoraproject.org n/a proxy36.fedoraproject.org n/a proxy37.fedoraproject.org n/a proxy38.fedoraproject.org n/a proxy39.fedoraproject.org n/a proxy40.fedoraproject.org
Yes please do this in multiple small PR requests. We will want to be able to merge them, test them in blocks which small pr's will work better for.
@smooge When you can please have a look at https://pagure.io/fedora-infra/ansible/pull-request/481 and let me know if it is ok. If yes, I will start modifying blocks of servers.
buildhw-a64-0[1-6].iad2.fedoraproject.org buildhw-a64-11.iad2.fedoraproject.org buildhw-a64-19.iad2.fedoraproject.org buildhw-a64-20.iad2.fedoraproject.org buildhw-x86-02.iad2.fedoraproject.org buildhw-x86-03.iad2.fedoraproject.org buildhw-x86-04.iad2.fedoraproject.org buildhw-x86-0[6-16].iad2.fedoraproject.org done. I'll keep updating the tickets as I push my modifications
The following servers are done also
buildvm-a32-0[1-27].iad2.fedoraproject.org buildvm-a32-0[1-2].stg.iad2.fedoraproject.org buildvm-a32-3[1-3].iad2.fedoraproject.org
@smooge Can you please assing the ticket to me at least so I can find it more easy?
Below servers are commited buildvm-a64-01.iad2.fedoraproject.org buildvm-a64-01.stg.iad2.fedoraproject.org buildvm-a64-02.iad2.fedoraproject.org buildvm-a64-02.stg.iad2.fedoraproject.org
The following servers are done buildvm-a64-[03-23].iad2.fedoraproject.org
Assigning to @bodanel :) Thanks for working on it.
Metadata Update from @kevin: - Issue assigned to bodanel
buildvm-ppc64le-[11-40].iad2.fedoraproject.org done.
Metadata Update from @bodanel: - Assignee reset
Metadata Update from @smooge: - Issue assigned to bodanel
Hi. just a quick suggestion, can we open a tracker for this somewhere, it will be easy to navigate what's done and what's left. It Will save time, it will be done if we can have a markdown render for file something like this:
so, one thing i have discovered/realized: we do not want to do this for vm's... only bare metal machines.
I think that should cut things down a lot. Basically only hosts that do not have vmhost: set on them.
Do you want to cleanup also existing VM's and let them on DHCP ?
No on dhcp, basically vm's are setup with static ips from our virt-install command and a few variables, so they always have the network setup we expect. If we try and use linux system roles/networking on them, we have to install them, have the playbook fail, update the new mac address and re-run the playbook. This is non ideal, instead we should just assume they have the correct static networking set when they were installed. Does that make sense?
I'm a new contributor to the Fedora Infrastructure team and I'd like to help out with this issue.
If a task list would still be useful, I generated one using the filenames that were uploaded in the tar ball. I can generate a PR request if someone directs me to a path to place this list.
I can then mark the completed systems as done and remove the virtual machines from the list. I can also begin editing the host_vars files as well with some direction.
Can you just attach the list here?
Thanks for working on this!
Full list found here: https://pagure.io/9695_ansible_cleanup/blob/master/f/systems.md
I'll work on removing the vmhost items from the task list today.
it's not on the list, but just in case - please do not do this on openQA worker host boxes (openqa-*-worker*). They have very specific network config requirements and you need to know how openQA works in order to change the networking config and make sure the changes work OK.
openqa-*-worker*
Full list found here: https://pagure.io/9695_ansible_cleanup/blob/master/f/systems.md I'll work on removing the vmhost items from the task list today.
vmhost are fine to stay. They are bare metal. :) it's 'buildvm*' that are all vm's...
Okay all 'buildvm*' hosts removed from the tasklist, reducing the number of hosts from 482 to 310.
Also all items mentioned as completed in the comments have been marked as completed in the tasklist.
I'm going to start working on a block of bvmhost-a64 machines. Can someone check over this first one and see if I'm configuring the bridge correctly? I also wasn't sure what to do with br0_dev.
So, we need also mac address in a - name: br0-port0 section... I am not sure how best to get the list of mac addresses to you. Could put them in a file on batcave01?
Basically the idea here is that when we specify mac address, linux system roles/networking will know exactly what interface we mean, without us having to call it eth0 or enasaskjdfjdkshgdysrjsdhfjs1 or whatever, and it will know to add that to a bridge, etc.
Right that makes sense about the br0-port0 section. Sure, just let me know where the file is located on batcave01.
Got it! Also should the dns and dns_search sections just match the original "New method:" example?
ok, I just copied the ansible facts cache for bvmhost* to /var/tmp/bvmhost-facts-cache/ so you should be able to get macaddress: info from there.
Yes, dns* should also be set. :)
Alright I updated the host_vars file for the first bvmhost-a64 server seen here at https://pagure.io/fedora-infra/ansible/pull-request/663. After a successful build I will have time to work on additional bvmhost-a64 servers this week.
I just made a pull request for the majority of the bvmhost servers https://pagure.io/fedora-infra/ansible/pull-request/663.
Checklist updated.
Three servers were missing from the files dumped in the batcave01 /var/tmp/bvmhost-facts-cache directory: - bvmhost-a64-10.iad2.fedoraproject.org - bvmhost-p08-03.iad2.fedoraproject.org - bvmhost-p08-04.iad2.fedoraproject.org
Ready for additional mac addresses. Perhaps buildvm* servers?
I'm ready for another mac address dump. - copr - vmhost - bvmhost-a64-10.iad2.fedoraproject.org - bvmhost-p08-03.iad2.fedoraproject.org - bvmhost-p08-04.iad2.fedoraproject.org
copr* are in aws, can be excluded.
These 3 are all down. The first has a bad disk issue, the other 2 are being decomissioned. - bvmhost-a64-10.iad2.fedoraproject.org - bvmhost-p08-03.iad2.fedoraproject.org - bvmhost-p08-04.iad2.fedoraproject.org
vmhost* is in /var/tmp/vmhosts/ on batcave01. :)
Pull request made for vmhost* servers https://pagure.io/fedora-infra/ansible/pull-request/740.
Tasklist has been updated. I also went ahead and removed servers on AWS. Not that many remaining!
I put the remaining server names in a text file if that helps. I'm ready for the last mac address dump whenever.
So that text file now includes these:
openqa-a64-worker01.iad2.fedoraproject.org openqa-a64-worker02.iad2.fedoraproject.org openqa-a64-worker03.iad2.fedoraproject.org openqa-p09-worker01.iad2.fedoraproject.org openqa-x86-worker01.iad2.fedoraproject.org openqa-x86-worker02.iad2.fedoraproject.org openqa-x86-worker04.iad2.fedoraproject.org
as I mentioned above, please DO NOT just convert these over. They, especially the ones in the openqa_tap_workers group, have very specific networking requirements that are likely not covered by system roles yet. I'm happy to talk with you regarding those requirements any time, just ping me in IRC/Matrix or something.
openqa_tap_workers
The following can be dropped / removed:
sign-vault01 (it doesn't get managed by ansible) kernel01 (it's not managed by us) bvmhost-a64-10 (it's down, we will have to add it later when it's fixed)
These should possibly not be in ansible/inventory anymore?
virthost-cloud01 (this doesn't exist anymore) buildhw-a64-07.iad2.fedoraproject.org (dead hw) buildhw-a64-09.iad2.fedoraproject.org (dead hw) buildhw-a64-10.iad2.fedoraproject.org (dead hw) host1plus (no long exists) osuosl03 (no longer exists) proxy07.fedoraproject.org (no longer exists) virthost-rdu02 (no longer exists)
The rest of their ansible facts should be in /tmp9695 on batcave. ;)
so from a quick look at the system roles stuff, I don't think it supports most of the advanced network stuff we need on the openQA worker hosts. So I think the strategy will be just to switch the configuration of the main physical interface(s) and possibly the bridge interface to use system roles, and continue setting the rest up the way we currently do in the plays. A couple of initial questions to figure out:
I'm gonna poke around a bit more tomorrow and maybe try to sketch out the changes for one host to see how it'd look.
I quickly looked at the systems roles documentation too and reached the same conclusion. I think it should be pretty straightforward to move the physical interface(s) over to use system-roles, assuming the two questions you brought up aren't an issue.
I'll be at work tomorrow, but I might have a few minutes free if you want to discuss changing over the first host.
Sorry, wasn't able to get back to this today, had some other things to work on. Will come back to it next week.
Pull request made for buildvm-s90x* servers https://pagure.io/fedora-infra/ansible/pull-request/757.
FYI, this is going to need to wait until after freeze at this point.
yeah, sorry, I didn't get around to it for the openqa workers in the end. will try to get to it after freeze.
Merged that last PR.
Whats left here? openqa and thats it?
I did the openQA workers yesterday:
I think they're okay, nothing obviously blew up anyway. It's only possible to use system-roles to configure the regular physical interfaces, setup of the bridge and tap devices on the tap worker hosts is still in the openqa/worker role, I added a comment explaining this.
openqa/worker
Thanks for the work on the openQA workers Adam!
Looks like there are a few stragglers left: buildvmhost-s390x-01.s390.fedoraproject.org ibiblio01.fedoraproject.org ibiblio05.fedoraproject.org internetx01.fedoraproject.org osuosl01.fedoraproject.org osuosl02.fedoraproject.org qvmhost-x86-02.iad2.fedoraproject.org retrace03.rdu-cc.fedoraproject.org storinator01.rdu-cc.fedoraproject.org virthost-cc-rdu01.fedoraproject.org virthost-cc-rdu02.fedoraproject.org virthost-cc-rdu03.fedoraproject.org * virthost-rdu01.fedoraproject.org
The ansible facts casche for those is in /tmp/9695 on batcave01.
Is there any way I can help or is it mostly done? Was just looking at open issues.
Mostly done. Last bit waiting for final freeze to be lifted.
Looks like I'll need the macs dumped for these again as they're gone from /tmp on batcave01:
In /tmp/ticket-9695/ on batcave01.
Metadata Update from @kevin: - Issue tagged with: unfreeze
Pull request made for last batch of servers https://pagure.io/fedora-infra/ansible/pull-request/872.
I messed up your PR by sorting all the hosts and group vars files. ;(
Can you rework the PR for that?
Ya no worries, I'll sort that out soon.
I believe I was able to successfully merge the changes from sorting the yaml host vars. See new pull request https://pagure.io/fedora-infra/ansible/pull-request/898.
The PR seems to revert all the vars changes or something? in any case it's got like 300+ files changed. ;(
Should just be a few...
Alright that's no good. See new pull request here (crosses fingers) https://pagure.io/fedora-infra/ansible/pull-request/899.
ok. Got those to mostly work. A few issues I hit:
So, looking at all hosts, dropping those that are cloud hosts, I see the following that still need fixing:
aarch64-test01.fedorainfracloud.org armv7-test01.fedorainfracloud.org bastion13.fedoraproject.org buildvm-s390x-01.stg.s390.fedoraproject.org download-cc-rdu01.fedoraproject.org ns05.fedoraproject.org pagure-stg01.fedoraproject.org proxy13.fedoraproject.org smtp-mm-cc-rdu01.fedoraproject.org smtp-mm-osuosl01.fedoraproject.org unbound-cc-rdu01.fedoraproject.org unbound-osuosl01.fedoraproject.org
And we should sort the bond devices on ibiblio01/05, but thats going to be a bit complex. ;(
Yeah I had some issues figuring out the last round of hosts because they were a bit different.
If you dump those remaining hosts on over to batcave01 I can take a look.
What do we need to do to sort out ibiblio01/05?
Finally what did you use to sort yaml? I hacked together something quick with Python, but just curious.
ok, in /tmp/9695 on batcave01 is the facts cache for those hosts.
remember to drop the auto6: false. :)
For sorting, I used 'yq'. https://github.com/mikefarah/yq/releases
yq eval 'sortKeys(..)' filename
The first host, aarch64-test01.fedorainfracloud.org, is on AWS.
The remaining hosts all have the vmhost key set. This means they are VMs correct? And we want to update only bare metal machines?
So, when we first started out I wanted to do everything. Then, I decided that vm's were hard because we didn't want to have to try and figure out mac addresses on them, since they change everytime the vm is installed. Then, I figured out that we can get around that by never hardcoding the mac for vm's.
So, now, I think I want to do every host except for aws/cloud hosts. The reason they are excluded is that they often have internal addresses and just map external in via the aws infra, so we can't set the network and if we did it would break things.
For hardware hosts, we want to specify the mac addresses of the hardware and which one(s) are used in the bridges. For vm's, we want to specify mac address as:
mac0: "{{ ansible_default_ipv4.macaddress }}"
This works because we install the vm and pass it the ip and such, so when we connect to that ip via ansible and gather facts, we have the mac address.
So, sorry for changing the scope here a bunch of times. ;(
Does that make sense?
No worries, let's do all the hosts then (except cloud of course)!
I think this makes sense, so for VMs I'll include the default interface name, eth0 etc., and for the mac address just have to specify the "{{ ansible_default_ipv4.macaddress }}" variable? Which means I may or may not need dumps for the remaining hosts.
I think it would help if we could chat just for a bit to strategize the rest of this project.
I'm on vacation until jan 3rd and on-line somewhere irregularly. You're welcome to ping me or chime in if you see me active tho, happy to chat on it more.
I'm going to add the following to group_vars/all:
# Default Network Connection (linux-system-roles networking) network_connections: - autoconnect: yes ip: address: - "{{ eth0_ipv4 }}/{{ eth0_ipv4_nm }}" dhcp4: no dns: - "{{ dns1 }}" - "{{ dns2 }}" dns_search: - "{{ dns_search1 }}" - "{{ dns_search2 }}" gateway4: "{{ eth0_ipv4_gw }}" mac: "{{ ansible_default_ipv4.macaddress }}" name: eth0 type: ethernet
I'm noticing in group_vars/all, dns1 and dns2, but not dns_search1 and dns_search2 are defined. Should I add these variables to group_vars/all? Additionally eth0_nm is defined, should it be changed to eth0_ipv4_nm?
On the same note, many vms define eth0_ip, gw, and nm variables. Do these need to be changed to eth0_ipv4, eth0_ipv4_nm, eth0_ipv4_gw, and eth0_ipv4_nm to match the variables in group_vars/all?
yes to all that. ;)
Pull request for default network connection added to group_vars/all https://pagure.io/fedora-infra/ansible/pull-request/924.
I'm making progress with writing a script to change all the nm host_vars at once. I've noticed some vms define just dns, while others define both dns1 and dns2.
If a vm host_var file defines dns1 and dns2 as the default values (10.3.163.33 and 10.3.163.34) should these variables just be deleted as the vm already has the default dns variables defined?
Also some vms define dns (8.8.8.8). In this case dns, dns1, and dns2 will all be defined. Is there a way around this?
I believe the dns variables are the last hurdle.
I'm making progress with writing a script to change all the nm host_vars at once. I've noticed some vms define just dns, while others define both dns1 and dns2. If a vm host_var file defines dns1 and dns2 as the default values (10.3.163.33 and 10.3.163.34) should these variables just be deleted as the vm already has the default dns variables defined?
yes.
Change dns to dns1 and add 8.8.4.4 as dns2?
Awesome.
Alright new pull request. https://pagure.io/fedora-infra/ansible/pull-request/931.
Should possibly be the last one!
A new default network connection is defined in group_vars/all. All non-cloud vms in host_vars then edited to conform to this new default. Exceptions also edited (either ipv6 interface or multiple interfaces) as well.
So, a bunch of testing and tweaking and a few more PR's and... we are finally done!
notifs-backend01 has issues, but thats not surprising. :(
Everything else is done as far as I can tell.
Many thanks to @petebuffon who worked so hard on this... kudos!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.