#716 Request for soft static uid/gid for slurm
Closed 6 years ago Opened 6 years ago by pkfed.

After several years of trying to get slurm into Fedora, I think we're finally
at point where the new package request for slurm is close to being ready.

https://bugzilla.redhat.com/show_bug.cgi?id=1489668

https://slurm.schedmd.com/

Slurm is a resource manager used typically on large HPC systems such as Cray,
IBM Blue Gene, etc. It consists of a number of interoperating daemons:

slurmd (compute node)
slurmctld (control node)
slurmdbd (accounting node)

and a suite of user-space programs to launch jobs and manage the slurm infrastructure.

Slurm uses munge for authentication and cryptographic services. Users who are able to
login to a particular node in order to launch a job will have their uid/gid credentials
propagated to the compute nodes allocated to their jobs using munge.

Slurm/munge relies on uid/gid consistency across all nodes.

Typically, slurm deployments will use ldap or some other centralized system to ease
the user management problem.

Slurm also requires that its inter-node RPC be configured to run either as root or as a
non-privileged user (for safety). That non-privileged user is called the SlurmUser.
The only requirement for it, as with all slurm user accounts, is that its uid be consistent
across all nodes. As the daemons talk to each other, communication is rejected if not done
either by root or by the uid that corresponds to the SlurmUser.

It would be safer to allocate a non-privileged slurm uid than to use root, so I structured
the rpm spec in that way. Note that it is not possible to use dynamic allocation -- where
the slurm user is created, but with no guarantee of cluster-wide uid consistency.
That strategy absolutely will not work.

User test below exists on all nodes, but has different uids on each. The local node
performs the uid lookup and sends the uid_t into the infrastructure. The job fails
because the uids are not consistent.

[root@pf26vm2 ~]# srun -N3 --uid test whoami
srun: error: Task launch for 230.0 failed on node pf26vm3: User not found on host
srun: error: Task launch for 230.0 failed on node pf26vm4: User not found on host
srun: error: Application launch failed: User not found on host

So we can either run as root everywhere or create a dedicated slurm user for the daemons
to run under and for the RPC to operate correctly. I think the latter would be safer.

Phil


After a discussion with upstream on this issue, there's a consensus that downstream packagers should make no assumptions about how users are managed on a slurm cluster. If ldap-type tools are used, for example, then introducing a dedicated slurm user locally on each node is not the right thing to do as it defeats the benefit of centralized management.

I am going to package for use by root. Closing this issue.

Metadata Update from @pkfed:
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata