#7184 RFR : aarch64 openshift cluster for OSBS
Closed: Fixed 5 years ago Opened 5 years ago by cverna.

  • Describe what you need us to do:
    I would like to deploy a new openshift cluster to add aarch64 to OSBS. For this I will require at least :
    • 2 aarch64 machines in production (1 master and 1 node)
    • 2 aarch64 machines in staging ( 1 master and 1 node)

If resources allow it would be nice to scale up to 3 machines per environment ( 1 master and 2 nodes).

Currently on our x86_64 the machines have the following specs :
lvm_size: 120g
mem_size: 16384
max_mem_size: 16384
num_cpus: 4

  • When do you need this? (YYYY/MM/DD)
    Ideally I would like this cluster to be ready and deployed for F29. If I could get the machine available before the end of august that would be great.

  • When is this no longer needed or useful? (YYYY/MM/DD)

  • If we cannot complete your request, what is the impact?


I am not sure if we have any readily available aarch64 systems we could turn this over to. I think we would need to turn off 4-6 current builders and dedicate them to this. The disk space is the major reason.. most of the systems have guests which are currently using most of the disk space on the boxes.

It isn't impossible.. it just means it will either require turning off some things somewhere.

is it a requirement that the hardware for this be in a private data center that is within the control of Fedora infra team? The reason I ask is because we might be able to get some resources donated from a cloud vendor for this if we would allow something like that.

I also cc @pbrobinson who might know how to get some hardware too.

Any hardware that needs to be gotten usually takes no less than 3 months to get in place. Peter works wonders but a lot of the hardware he will have available is going to be prerelease hardware which then needs to wait for firmware updates to work with Fedora. It then is usually a couple of months of 'ok that fixed this problem but now we have this' cycle of firmware updates. I don't know how much of OSBS needs from the regular PHX2 facility so it may be better to look at dusty's resources.

If it is tied heavily with NFS/koji/fedora DB's/etc which make it not possible to be outside, then we will just need to figure out what gets 'shelved' for this be implemented.

The only thing I can see that we can do here is to reduce the number of builders and repurpose some of those for this. For staging could you make due with 1 (that runs master and node?)?

The only thing I can see that we can do here is to reduce the number of builders and repurpose some of those for this. For staging could you make due with 1 (that runs master and node?)?

I have no idea, on how OSBS would behave with such a setup, but if we can get one machine in stg relatively easily then I all for trying it.

@smooge
I don't know how much of OSBS needs from the regular PHX2 facility so it may be better to look at dusty's resources.
If it is tied heavily with NFS/koji/fedora DB's/etc which make it not possible to be outside, then we will just need to figure out what gets 'shelved' for this be implemented.

@cverna ^^ do you know?

@smooge
I don't know how much of OSBS needs from the regular PHX2 facility so it may be better to look at dusty's resources.
If it is tied heavily with NFS/koji/fedora DB's/etc which make it not possible to be outside, then we will just need to figure out what gets 'shelved' for this be implemented.

@cverna ^^ do you know?

Honestly I can't see a reason why it would requires to be in PHX2 network but I might be missing something.
Currently we have the x86_64 cluster in PHX2, if we have the aarch64 cluster outside this network would that impact the ability of the 2 clusters to be connected together ?
For multi arch, the OSBS orchestrator node would be on a x86_64 machine (PHX2) and would delegate aarch64 builds to the aarch64 Openshift cluster (outside network).

There's two Cavium ThunderX hosts in PHX2 awaiting the new Openstack (or what ever it becomes) infra which are intended for this exact purpose. This and copr was their intended use.

They will definitely need the latest firmware but I believe they're racked/powered/networked awaiting provisioning.

The idea was to run as hypervisor and put both OpenShift and copr workloads split across the two for resilience.

OK currently they are on the network with copr/cloud so we will need to figure out what network allowances are required to make it work.

@pbrobinson , can you help on the firmware items and how long it might take? Also any gotchas that you might know about? [AKA walk 3 times windershins around the rack before deploying, etc]
@cverna can you help on what ports/protocols are needed for coordination between the systems so we can see what is needed on our part?

[I am just wanting to get a feel if we can make next Thursday for his desired date, or F29 for his required date, or later. ]

OK currently they are on the network with copr/cloud so we will need to figure out what network allowances are required to make it work.
@pbrobinson , can you help on the firmware items and how long it might take? Also any gotchas that you might know about? [AKA walk 3 times windershins around the rack before deploying, etc]
@cverna can you help on what ports/protocols are needed for coordination between the systems so we can see what is needed on our part?

The clusters are exchanging together via HTTP APIs, so we should not need anything fancy.

[I am just wanting to get a feel if we can make next Thursday for his desired date, or F29 for his required date, or later. ]

I don't really like the idea of doing builds in a seperate cloud network that has a bunch of untrusted stuff running in it. We have currently 24 moonshot carts that are running armv7/aarch64 builders. If we can pull 3 of them (1 for stg and 2 for prod) that still leaves us 21, which I think would be ok...

I don't really like the idea of doing builds in a seperate cloud network that has a bunch of untrusted stuff running in it. We have currently 24 moonshot carts that are running armv7/aarch64 builders. If we can pull 3 of them (1 for stg and 2 for prod) that still leaves us 21, which I think would be ok...

That would work for me. Do you think that it would be possible to have the stg box setup this week ? I would like to try to deploy the stg cluster during the beta freeze.

So I pushed a new playbook in ansible to deploy the Origin cluster. In this playbook I assumed the aarch64 hosts will be named :

  • osbs-aarch64-nodes-stg and osbs-aarch64-masters-stg

I did only stg since we are in freeze.

We ran into some issues deploying this vm. Seems to hang on boot... so we are trying various things to get it installed now. ;(

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

5 years ago

Metadata Update from @smooge:
- Issue assigned to smooge

5 years ago

The staging host is now installable. The ansible playbook ran to completion but I don't know what that means :smile: .

When trying to deploy the origin cluster on the staging host, I run in the following error.

Origin is trying to create the docker registry pod but fails.

Events:                                                                                                                                                                                                            
  Type     Reason                 Age   From                                                      Message                                                                                                          
  ----     ------                 ----  ----                                                      -------                                                                                                          
  Normal   Scheduled              8m    default-scheduler                                         Successfully assigned docker-registry-1-lzjm7 to osbs-aarch64-master01.stg.arm.fedoraproject.org                 
  Normal   SuccessfulMountVolume  8m    kubelet, osbs-aarch64-master01.stg.arm.fedoraproject.org  MountVolume.SetUp succeeded for volume "registry-certificates"                                                   
  Normal   SuccessfulMountVolume  8m    kubelet, osbs-aarch64-master01.stg.arm.fedoraproject.org  MountVolume.SetUp succeeded for volume "registry-token-6f8wr"                                                    
  Warning  FailedMount            8m    kubelet, osbs-aarch64-master01.stg.arm.fedoraproject.org  MountVolume.SetUp failed for volume "openshift-stg-registry-volume" : mount failed: exit status 32               
Mounting command: systemd-run  
Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/pods/7bcd92ab-d142-11e8-8b83-525400e0ac42/volumes/kubernetes.io~nfs/openshift-stg-registry-volume --scope 
-- mount -t nfs ntap-phx2-c01-fedora01-nfs.storage.phx2.redhat.com://openshift-stg-registry /var/lib/origin/openshift.local.volumes/pods/7bcd92ab-d142-11e8-8b83-525400e0ac42/volumes/kubernetes.io~nfs/openshift-s
tg-registry-volume                                                                                                                                                                                                 
Output: Running scope as unit: run-r190fc1d93ec140f8ba7266c2c9818946.scope                                                                                                                                         
mount.nfs: access denied by server while mounting ntap-phx2-c01-fedora01-nfs.storage.phx2.redhat.com://openshift-stg-registry                                                                                      
  Warning  FailedMount  8m  kubelet, osbs-aarch64-master01.stg.arm.fedoraproject.org  MountVolume.SetUp failed for volume "openshift-stg-registry-volume" : mount failed: exit status 32

I am not sure what to do here to fix this issue, any help is welcome :smiley:

This was caused by some config from our openshift instances getting into the osbs template.

Commented that out and the playbook completes...

Staging is all done, we will have you open a new ticket when you are ready for prod nodes.

:cherry_blossom:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Login to comment on this ticket.

Metadata