After several years of trying to get slurm into Fedora, I think we're finally at point where the new package request for slurm is close to being ready.
https://bugzilla.redhat.com/show_bug.cgi?id=1489668
https://slurm.schedmd.com/
Slurm is a resource manager used typically on large HPC systems such as Cray, IBM Blue Gene, etc. It consists of a number of interoperating daemons:
slurmd (compute node) slurmctld (control node) slurmdbd (accounting node)
and a suite of user-space programs to launch jobs and manage the slurm infrastructure.
Slurm uses munge for authentication and cryptographic services. Users who are able to login to a particular node in order to launch a job will have their uid/gid credentials propagated to the compute nodes allocated to their jobs using munge.
Slurm/munge relies on uid/gid consistency across all nodes.
Typically, slurm deployments will use ldap or some other centralized system to ease the user management problem.
Slurm also requires that its inter-node RPC be configured to run either as root or as a non-privileged user (for safety). That non-privileged user is called the SlurmUser. The only requirement for it, as with all slurm user accounts, is that its uid be consistent across all nodes. As the daemons talk to each other, communication is rejected if not done either by root or by the uid that corresponds to the SlurmUser.
SlurmUser
It would be safer to allocate a non-privileged slurm uid than to use root, so I structured the rpm spec in that way. Note that it is not possible to use dynamic allocation -- where the slurm user is created, but with no guarantee of cluster-wide uid consistency. That strategy absolutely will not work.
User test below exists on all nodes, but has different uids on each. The local node performs the uid lookup and sends the uid_t into the infrastructure. The job fails because the uids are not consistent.
test
[root@pf26vm2 ~]# srun -N3 --uid test whoami srun: error: Task launch for 230.0 failed on node pf26vm3: User not found on host srun: error: Task launch for 230.0 failed on node pf26vm4: User not found on host srun: error: Application launch failed: User not found on host
So we can either run as root everywhere or create a dedicated slurm user for the daemons to run under and for the RPC to operate correctly. I think the latter would be safer.
Phil
After a discussion with upstream on this issue, there's a consensus that downstream packagers should make no assumptions about how users are managed on a slurm cluster. If ldap-type tools are used, for example, then introducing a dedicated slurm user locally on each node is not the right thing to do as it defeats the benefit of centralized management.
I am going to package for use by root. Closing this issue.
Metadata Update from @pkfed: - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.