#524 static UID for ceph
Closed: Fixed None Opened 4 years ago by ktdreyer.

Ceph is currently developing unprivileged user support, and the upstream developers are asking each of the distros to allocate a static numerical UID for the "ceph" user.

The reason for the static number is to allow Ceph users to hot-swap hard drives from one OSD server to another. Currently this practice works because Ceph writes its files as "uid 0", but when Ceph starts writing its files as an unprivileged user, it will be nice to allow administrators to unplug a drive and plug it into another computer and everything to just work.


(Not sure what component to file this under, so I'm picking "Clarification").

This seems kind of bizarre. But also, the stated goal doesn't make much sense unless the distros all pick the same UID. Otherwise if you want to swap drives between machines at an end user site, that site should just fix a UID locally.

Upstream acknowledged that it's unlikely that all the distros will agree right off the bat, but it would be nice to make "all Fedora systems" agree, or "all RHEL systems" agree, etc.

Another way to look at it is that none of the other distros have selected a static UID yet for ceph. Fedora can set the standard.

We discussed this at today's meeting (https://lists.fedoraproject.org/pipermail/packaging/2015-April/010561.html):

  • 524 static UID for ceph (geppetto, 16:12:49)

  • LINK: https://fedorahosted.org/fpc/ticket/524 (geppetto, 16:12:55)
  • ACTION: A link/summary of the upstream discussion of how this will
    be used, might be nice. Currently split on how this is useful.
    (geppetto, 16:40:57)

Hi folks, thanks for taking the time to discuss this in the meeting today.

I read the meeting notes and I saw tibbs had a question about how this works. Quoting here:

{{{
<tibbs|w> If they're a filesystem, they could just specify the UID at mount time like FAT does.
<geppetto> I assume the ceph stuff needs write access to the files
<tibbs|w> Or is it some kind of layered thing where there's really ext4 underneath.
}}}

CephFS is a network filesystem, but that is only one part of Ceph. Ceph is fundamentally an object storage system (and CephFS is built on top of that). Ceph's "OSDs", Object Storage Daemons, are the piece that will be writing the object data to disks as an unprivileged user. And the OSDs write to a backing filesystem like btrfs, xfs, or ext4. I think geppetto pretty much covered this during the meeting, so thanks geppetto for doing that.

Regarding links to upstream discussion, some of the discussion has been on ceph-devel: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/24294 (warning, the thread's a bit long)

Here is the static UID requests from Debian: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/24325 . The Debian maintainer (Gaudenz) mentioned off-list that he thinks Debian will probably allocate uid/gid 64045, and I've asked him for more details regarding this. I think it's because it's the next available one in https://anonscm.debian.org/cgit/users/cjwatson/base-passwd.git/tree/README

Replying to [comment:7 ktdreyer]:

The Debian maintainer (Gaudenz) mentioned off-list that he thinks Debian will probably allocate uid/gid 64045, and I've asked him for more details regarding this. I think it's because it's the next available one in https://anonscm.debian.org/cgit/users/cjwatson/base-passwd.git/tree/README

If that's the case then you're SOL for cross-distro compatibility, because all of our static UIDs will be in the very low range. (Static is under 200, isn't it?)

Did anyone think of having a shim to do a chown on the filesystem and get around this limitation entirely? Surely swapping disks between systems is uncommon enough that the cost of a recursive chown or setfacl is bearable?

Anyway, I'm +1 here so I'm not arguing against what Ceph is trying to do, but I wonder if someone who really cares about swapping between machines on a regular basis couldn't just solve this with ACLs (or fixing a UID locally). They'll have to do it anyway if they care about doing this cross-distro. As I said in the meeting, fixing a UID seems like the least creative solution here.

We will definitely need a shim that will chown -R when necessary (when the uid used on the disk does not match the 'ceph' user's uid on the host the disk was just plugged in to). This will clearly trigger when you swap a disk between distros (e.g., when migrating a debian ceph cluster to fedora). The idea is we'll do that chown either from the udev trigger or systemd prestart script before we drop privileges and start the actual ceph-osd daemon as user 'ceph'.

However, the chown is obviously very slow and expensive. We want to avoid it in the common case where the disk is swapped between hosts using the same distro. Hence the static uids allocated for each distro. This will make disk swaps between hosts inexpensive for the vast majority of users.

Thanks!

Replying to [comment:11 lnussel]:

Maybe a good reason to bring this up again:
https://github.com/LinuxStandardBase/lsb/blob/master/documents/wip/userNaming.txt

So, Debian has a process for reserving UIDs in the range 60000-64999, Fedora is running out of reserved IDs < 200, and SUSE doesn't have a process for reserving specific UIDs. All distros seem to have the maximum UID for users as 59999 or 60000.

How about this then: Why don't we all (Fedora, SUSE) just use whatever ID comes out of the Debian process. If it's above 60000 it shouldn't conflict with anything else, and then we get the cross distro fixed UID we never thought we'd have ;-)

I don't agree with the statement we are running out of the UID/GIDs <200 ... There is only ~120 reserved uids and similar number of reserved gids ( cut -f2 </usr/share/doc/setup*/uidgid | grep [0-9] | sort | uniq | wc -l for uids). Yes, we should not assign them for every single system account, but there is still lot of free uidgid pairs.

You cannot assume to know what any particular site has assigned in the 60000+ range. We really shouldn't go changing our rules now. Maybe some of the RHEL people could tell us if they know of anything odd their customers might be doing in this area. (Not that it matters much to Fedora, but it might at least tell us some useful info.)

Note that Anaconda is (in F23) getting the functionality needed to place files in the filesystem before packages are installed, and pre-F23 there is a horrible but workable hack to do the same. My hope is that we can use this feature in combination with some modifications to the setup package to fix UIDs, but I need to talk to the setup maintainers about that.

Replying to [comment:14 tibbs]:

You cannot assume to know what any particular site has assigned in the 60000+ range. We really shouldn't go changing our rules now. Maybe some of the RHEL people could tell us if they know of anything odd their customers might be doing in this area. (Not that it matters much to Fedora, but it might at least tell us some useful info.)

With my Red Hat customer hat on, I can say that high UID range is not safe to claim in RHEL, so I'd advise against doing it in Fedora. Companies do use that range for their purposes.

We discussed this at this weeks meeting (http://meetbot.fedoraproject.org/fedora-meeting-1/2015-04-23/fpc.2015-04-23-16.00.txt):

  • 524 static UID for ceph (geppetto, 16:26:41)

  • LINK: https://fedorahosted.org/fpc/ticket/524 (geppetto, 16:26:48)
  • ACTION: We don't see the point, given that Debian won't be
    compatible and you have fixup scripts. If the fixup scripts take an
    annoying amount of time to run, then we might reconsider.
    (geppetto, 16:47:13)

The fixup script will absolutely take an annoying time to run... it is a chown -R on an entire disk (4T-6T today, and getting bigger)!! Just about the only way it could be any worse would be if we literally rewrote all the data too.

Matching uids with other distros is a "nice to have," but not a requirement. It should not be a consideration in this decision. Being able to migrate a disk from a failed server to another host (of the same distro) without waiting hours is a huge win for users and data availability. Please please please reconsider! I really don't want Fedora/RHEL to be the odd one out here.

(Saying almost the same as Sage, mid-air collision :)

Hi,

the Debian compatibility is not the main point here (and because of different ranges it could be hardly implemented).

But Ceph need static UID in Fedora. Ceph is by definition storage system which run on multiple machines and OSDs (nodes which store data, using dedicated fs on dedicated disks) must operate with the same UID (IOW if you take one disk from one node and connect to another, it must work without without massive chown -R over filesystem). It is uncommon operation but it is used in reality. Relying on fixup script is very ugly workaround here and could have even some security aspects here.

Currently Ceph OSD daemon runs as root, so it is not the problem yet. But upstream is already switching to running daemons under different non-root account and thus it will need static UID for accessing own files.

Please reconsider your decision, to me this is clear reason to assign static UID (I mean from the Fedora range).

Thanks,
Milan

Ceph's OSDs store a lot of files. For example, on one of the drives on one of the OSD servers in our test lab:

{{{
burnupi62:~$ df -h /var/lib/ceph/osd/ceph-104/
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 930G 802G 129G 87% /var/lib/ceph/osd/ceph-104

burnupi62:~$ time find /var/lib/ceph/osd/ceph-104/ | wc -l
1073463

real 26m7.581s
user 0m4.833s
sys 1m3.043s
}}}

Over a million files and directories on this small partition, and that 26 minutes is just reading, no chmod writes yet.

I'm adding it back to the meeting, but why does it take that long? This is simple ext4 over external USB:

{{{
% sudo umount /mnt/backups
% sudo mount /mnt/backups
% time find /mnt/backups/backups/code-backup/Mail | wc -l
785425
find /mnt/backups/backups/code-backup/Mail 0.86s user 0.53s system 26% cpu 5.355 total
wc -l 0.02s user 0.06s system 1% cpu 5.354 total
% time find /mnt/backups/backups/code-backup | wc -l
2683280
[2] 14580 exit 1 find /mnt/backups/backups/code-backup |
14581 done wc -l
find /mnt/backups/backups/code-backup 2.70s user 7.20s system 8% cpu 1:57.35 total
wc -l 0.16s user 0.62s system 0% cpu 1:57.34 total

}}}

I can't explain why your find command was faster than mine, but I'm just trying to address the question in the meeting that came up last time regarding "how many files are on an OSD":

{{{
16:31:50 <tibbs|w> Now, how many backing files are typically on a ceph volume?
16:32:01 <tibbs|w> Here before it was mentioned that the files were huge.
16:32:14 <tibbs|w> Which means there probably aren't very many of them, and a chmod wouldn't take very long.
16:32:16 <geppetto> yeh, I thought they weren't small
16:32:24 <tibbs|w> If there are millions then I would be more concerned.
}}}

It's probable that the time of my "find" run above wasn't relevant so I apologize if that's the case. My point was that Ceph's OSDs are implemented using a lot of files across the drive.

Well it's up to you, but having correct data here would be a big win for you I think.

Like if pulling a drive from from one box and putting it in another is maybe going to eat 5 seconds, I don't see the need. If it would normally take a minute or two, I'm suspicious. But if it really is normally 20-30 minutes, then I would just approve it.

So:

Is a million just a random number you had on your dev. box, or is that realistic for a 1TB brick?

This scales linearly with brick size (10TB brick == 10 million files)?

Is that then just a generic mapping Ceph files on a brick are 1MB, so divide size by that to get number of files?

Is this configurable? (if so is 1MB the default?)

Does this change based on the usage of the Ceph brick, or is it preallocated?

Was your time an anomaly, and/or you were running something at the time?

Do you know if find/chown would normally scale linearly (my timing didn't seem to, but I'm not sure if that was kernel; memory; vfs; or something else).

The bricks are just normal files on ext4, right?

Jason mentioned that it should be possible to add an ACL to the top level mount point, have you looked at this?

The argument

"I don't see the need. If it would normally take a minute or two, I'm suspicious. But if it really is normally 20-30 minutes, then I would just approve it."

is something what I really disagree. Even if there are no files and this operation is no-op, it makes sense to have static ID here.

1) When you switch the disk, you have NO idea how long it will take to relabel. There can one or million files, there can be high utilization of disk controller, etc. It brings additional non-deterministic behavior to the system.

2) If the files are owned by some other UID with local account on system, you are opening access to these files for some time. (Yes, there are probably other mechanism which should prevent access but even that it can be dangerous.)

3) If it is e.g. iSCSI device, the switch of disk between nodes is trivial, I really do not see reason why it should repeatedly trigger ownership change. This is high-availability system and switch should be fast without any additional operations.
If it crashes in the middle of this operation, you have very dangerous state which must be solved manually. Why add this possibility when we have simple solution? Static ID is nothing new, it is used for many other services.

I agree that if there is problem with slow chown -R, it should be solved but this is not related to the reason Ceph needs static ID,
speed of ownership change is IMHO red-herring here.

Adding mount option is filesystem-type dependent, it would need revisit for every new filesystem Ceph will use (currently it can use XFS, ext4, btrfs IIRC).

Please try to see it from the high level design point of view - this is distributed system used to build high-available solutions, the environment on all nodes it uses should be the same and processes to recover from a node failure should be very simple.

Thank you!

I agree with mbroz that the time isn't the real issue. But, to answer the question, a million files on a 1TB disk is totally representative. What is unrealistic is that most people deploy with 4TB disks today, not 1TB.

We discussed this at this weeks meeting (http://meetbot.fedoraproject.org/fedora-meeting-1/2015-05-07/fpc.2015-05-07-16.01.txt):

Thanks all for accepting this!

I've filed https://bugzilla.redhat.com/1220846 today to track the UID/GID assignment in the setup package.

Metadata Update from @rathann:
- Issue assigned to james

2 years ago

Login to comment on this ticket.

Metadata