Issue #8407: aws permissions for starting spot instances for `aws-copr` - fedora-infrastructure

Can you try again now?

Metadata Update from @kevin:
- Issue priority set to: Waiting on Reporter (was: Needs Review)

4 years ago

Still the same issue:

Spot Request Failed
You are not authorized to perform this operation.
Hide launch log
Creating security groups

    Successful (sg-0d1643ae167188f78)
Authorizing inbound rules

    Successful
Requesting Spot Instances


FailureRetry

ok. Made another change... try again?

Also, if you can try with the cli and see if it provides more information on what was failed?

fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "Instance creation
failed => UnauthorizedOperation: You are not authorized to perform this
operation."}

Cli prints the same useless message. I tried through ansible playbook, though,
because I don't know how to use the cli properly, yet. From the ami permission
problem #8421 though I don't think it would be more descriptive.

Edited 4 years ago by praiskup

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Waiting on Reporter)

4 years ago

cverna commented 4 years ago

So I think this was sorted out by @kevin . Can we close this ticket ?

Last time (last week) when I talked to @msuchy this did not work yet.

It does not , just hit the same thing with our account also

Not sure if this is still an issue.
When a user needs to create a spot fleet they need the following IAM permissions to be able to pass a role. It is required that they specify a IAM fleet role of some sort

"iam:ListRoles",
"iam:PassRole",
"iam:ListInstanceProfiles"

AWS offers a default IAM fleet role which can be passed to the instances called aws-ec2-spot-fleet-tagging-role that will allow instances to be requested, launched, terminated, and tagged automatically.

It is a problem. I get this when trying to launch an instance with a spot request:

You are not authorized to perform this operation. Encoded authorization failure message: 4Mi3M1NKwlumeltBTaHYtYz7EKgpS_f9O1BygIZhMfVP3woxQdpeou_j5woo3A1FBB4g57Y0314Y9Kt8R4zlhFfxMqIunIU2FKghF9eglNzTQ7Ihq-J28ACkpr-_vaJEv04EJwm-9gd824W-7PNooUWfIJpXQCKKBugl1xEf4XUaXuhKGPeKehxdlgvasFnunWy9lbxn9ZcdTrIMiz-9pDTGWUMBdeqbkoaRaW4ib3-ZJf3zPdW_pLGVV6bLS1zNnhI4ysYJtFPYw9l0L9-gQ4VZaV6LGLTwWfzmss-FI3huf5rfUD8v8o9NeoWy9Du0MjoRsYTTzDHW3otXJKx73Km1Adqnf-9NUpasuss8obAH8M_Rh87vqd5odjrQn5yMzb3Vx3gGck14ABa6AU_pP-mDPKQtR5cIEH5qLLNwtsyarOlmZ_PZCYohTpEHPQU3R5v5BcZWdnUQ5nt9yTgsMbIm14eAg5a7zCsVsbeCX6HNSdEeotlzD1KWlrtc1m2zPevcvJnZ5OHk1ZZF6Lr-nWDG5nuH1638lg

Edited 4 years ago by mvadkert

@mvadkert could you run the command aws sts decode-authorization-message --encoded-message <encoded_string> using the encoded string from your comment to decode it please

@mobrien thanks for the command :)

An error occurred (AccessDenied) when calling the DecodeAuthorizationMessage operation: User: arn:aws:iam::125523088429:user/fedora-ci-testing-farm is not authorized to perform: sts:DecodeAuthorizationMessage

quite a funny message :D

because it works from cmdline :)

But I use there the automation user we have:

    "Arn": "arn:aws:iam::125523088429:user/fedora-ci-testing-farm"

Ok looks like we will need an administrator to run the sts command to decode the message. The reason it has been encoded is that that it could potentially contain privileged information.
It should hopefully contain the permissions that are causing the issue though.

Ah right, could be :) just the message confused me3

@kevin could we maybe sometime resolve also this one pls?

So from the output of that decoded message that Kevin provided (thanks Kevin) it appears as though the assumed role aws-fedora-ci/mvadkert does not have the permissions ec2:RequestSpotInstancespermission

I have added ec2:RequestSpotInstances to both fedora-ci-ec2 and copr-ec2 policies. Can you both try again and see what happens?

Spot Request Failed
The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.
Hide launch log

Creating security groups                Successful (sg-0ac77ca30c28155e5)
Authorizing inbound rules               Successful
Requesting Spot Instances               Failure

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#service-linked-roles-spot-instance-requests

Looks like there needs to be a service-linked role for requesting spot instances. These allow AWS service to make requests on your behalf (ec2 requests in this case). There is a default Role for this scenario, AWSServiceRoleForEC2Spot

Ah ha.

Added.

@praiskup try now?

Seems like some max limit for spot instances:

Spot Request Failed
Max spot instance count exceeded
Hide launch log

Creating security groups Successful (sg-016538d7de562c291)
Authorizing inbound rules Successful
Requesting Spot Instances Failure

Looks like the permissions are good. AWS places an initial limit on spot instances per account, as they do with most resources, to help them not get overloaded unexpectedly. There will usually be an overall limit and an instance type limit.

These limits can be seen in the limits section in the EC2 dashboard. We have the default limit which is 20, this limit is per region also.

This can be raised with a request to AWS support and they will almost always accept the request if it is reasonable

Edited 4 years ago by mobrien

Well our reason is that we would like to use spot instances for our CI workloads, where spot instaces are the perfect fit :)

@mobrien thanks for the note to default limite, I was not even aware there is such a thing.

@davdunc Can we get this limit raised on our account? Or should we use the normal process? or something else?

@kevin I asked to raise it on our internal account and it was not an issue, they raised it to 100 instances without any additional questions.

Are we supposed to request it ourselves (copr team)?

https://console.aws.amazon.com/support/home?#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances

@praiskup nope, it is a per account setting afaik. So the Fedora admins owning the AWS account need to do it:

We do not have permissions for that anyway

To clarify here, @davdunc is our community contact at amazon. Since out account is a community account, I wanted to ask him about it before blindly following the normal process. It may be that he can just raise it for us, or that there's some other process for community accounts, or indeed we should just follow the normal process. But I wanted to actually find out.

I'll try and catch him tuesday and see if I can find out more.

David was nice enough to increase the limit to 100 (in us-east-1, please let me know if other regions are needed)

I think we are all done here? Please reopen if I missed something we still need to do.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

@kevin, @davdunc thank you! I suppose the limit 100 is shared among whole
fedora community account (not only for aws-copr); since the spot
instances are almost perfect fit for Copr (we plan to run most of the
builders as spot instances, as we can afford restarting them) I'd expect
that we'll be able to utilize ~150 spot instances in peak times (the next
copr release will be much, much more flexible from this POV).

Could the limit be raised to something around 150(for copr) + others, so
say 250?

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 years ago

I tried to launch single instance as aws-copr, and it still doesn't work:

Spot Request Failed
Max spot instance count exceeded
Hide launch log
Requesting Spot Instances

FailureRetry

Can we set the limit per sub-account within our community account?

What instance type are you planning on using?

davdunc commented 4 years ago

updated request to max 150.

davdunc commented 4 years ago

The increase is under evaluation. This is expected and can take some time to move to approved. I'll update with the approval.

What instance type are you planning on using?

Currently we use i3.large for x86_64, and a1.xlarge for aarch64.
But that's because those are the cheapest types that have specs
>= than the builders we previously had in OpenStack.

Our plan was to have some set of normal instances to guarantee some
throughput, and complement it by spot instances.

I want to underline again that the new copr release will be much more
flexible, and we'd like go up only when there's high load (copr would do
this automatically). But in normal situations we plan to start even
smaller set of workers than we have now (now we have 50 x86 and 10 arms, I
think we could go with "15+5" to "150+50").

We basically didn't want to have different instance types across our
builders at the beginning, but indeed minimal percentage of builds in copr
needs to go with i3.large; there's an open possibility to use smaller
instance type in general, and and the large variants on explicit copr
user's request. (@msuchy, fyi, as we didn't want to concentrate on this
topic ATM).

updated request to max 150.

Thank you.

The increase is under evaluation. This is expected and can take some
time to move to approved. I'll update with the approval.

That's probably why I got "Max spot instance count exceeded". What
is the current (previous) limit? Am I able to see (aws-copr credentials)
what instances ate the quota when it is counted for whole fedora account?

Thank you for looking at this!

davdunc commented 4 years ago

What instance type are you planning on using?

Currently we use i3.large for x86_64, and a1.xlarge for aarch64.
But that's because those are the cheapest types that have specs

updated request to max 150.

Thank you.

The increase is under evaluation. This is expected and can take some
time to move to approved. I'll update with the approval.

That's probably why I got "Max spot instance count exceeded". What
is the current (previous) limit? Am I able to see (aws-copr credentials)
what instances ate the quota when it is counted for whole fedora account?
Thank you for looking at this!

Still in review. The original request was rejected and the reason was "account compromised" I am chasing down why customer service identified that status. I have a feeling that it is related to the internal account payer for an external account. It's non-standard.

@davdunc any news here?

pingou commented 3 years ago

@davdunc did you manage to get it through?

@davdunc This might go through now that we cleaned up that status issue? Can you try again now?

So, I currently see:

Service quota
Applied quota value
AWS default quota value
    Adjustable
    All F Spot Instance Requests    
128 
0
    Yes
    All G Spot Instance Requests    
128
0
    Yes
    All Inf Spot Instance Requests  
128
0
    Yes
    All P Spot Instance Requests    
128
0
    Yes
    All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests     
1,152
0
    Yes
    All X Spot Instance Requests    
128
0
    Yes

So, it seems to be 128 for most types. IS that enough? can you try again now?

Metadata Update from @smooge:
- Issue assigned to kevin
- Issue tagged with: medium-gain, medium-trouble, ops

3 years ago