#8964 docs too big to fit on proxies anymore
Closed: Fixed 3 years ago by kevin. Opened 3 years ago by kevin.

I'm guessing it's the addition of translations? But docs has grown vastly in the last few days.

On may 23rd it was building on a ~10GB volume, I bumped it to 20GB
On may 28th it filled the 20GB and I bumped it to 30GB
It filled that and I bumped it to 50GB

It just filled that.

Also, thats just the building side. After it's built it syncs to all our proxies. Some of which have a ~25GB / partition.

I have disabled the docs sync until we can sort this out.

Some options:

  • Find a way to reduce space somehow?

  • Stop syncing to proies and serve docs from openshift with perhaps cloudfront in front of them?

  • Serve docs from only some subset of proxies that we can grow space on (I don't like that tho, as it makes some of them 'special')

  • Some other brilliant idea.

cc @bcotton @asamalik @pbokoc


cc @jibecfed too

Do we know what's taking up all of that space? I wouldn't expect the docs themselves to take up a lot of space. Are we duplicating images across all of the translated languages maybe?

So each translation tree after it is built is 245 MB in size and there are 39 trees. That is 10GB.
The problem seems to be when the [xx].building tree is created in the netapp share. Each one of these is itself over 10 GB in size.. It looks like the files inside of say te.building/ is every tree in /srv/docs over again.. so there are /te.building/ar etc etc until it gets done.. then it deletes that /te.building and does the same thing in /tr.building..

those deleted directories are snapshot saved which takes up disk space. and the second problem is that the rsync to all the proxies copies those .building trees to each proxy.. so they run out of space depending one when they did an rsync and what was in the tree at the time.

I am not sure why every tree is getting rebuilt 40 times but that seems to be a bug :smile:

Here is what I know:

My technical knowledge of this doesn't allow me to understand smooge's comment. I don't really see the point of a sync of ".building" folders. It should probably be a temporary folder in memory.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: groomed, high-gain, high-trouble, websites-general

3 years ago

We have added translations which I guess caused it. It worked perfectly fine in staging, so I wasn't expecting something like this. Sorry for causing this.

Very quick workaround, so docs can be building gain, is to disable the translated part. That could be done by by removing this cron-job (infra ansible link).

Then we can work on how to fix this properly and get the translated part back again.

So each translation tree after it is built is 245 MB in size and there are 39 trees. That is 10GB.
The problem seems to be when the [xx].building tree is created in the netapp share. Each one of these is itself over 10 GB in size.. It looks like the files inside of say te.building/ is every tree in /srv/docs over again.. so there are /te.building/ar etc etc until it gets done.. then it deletes that /te.building and does the same thing in /tr.building..
those deleted directories are snapshot saved which takes up disk space. and the second problem is that the rsync to all the proxies copies those .building trees to each proxy.. so they run out of space depending one when they did an rsync and what was in the tree at the time.
I am not sure why every tree is getting rebuilt 40 times but that seems to be a bug 😄

That would definitely be a bug, yes, let me look at that!

I have fixed the issue @smooge has described. Thanks for noticing!

Each language got built once as it should, but the copying got all wrong.

It builds onto a local partition. The script then copies it out to the mounted partition from which it's deployed — those are the [xx].building directories. And finally, it deletes the old trees and moves the new ones in their place. Well, at least that was the idea.

But it's fixed now, it'll be doing what it's supposed to be doing, and the final size will be much smaller.

(Right now there is this old pod running still with the old script. If someone could please kill it (I don't have perms in prod) or when it's done, the next one will get it right.)

Also, there will be a temporary copying.tmp directory created when copying the results. Can we configure the sync to exclude it? That would save space on the mirrors in cases when the times of the sync and the copying overlap.

If the directory name is stable.. it is easy to exclude it. I was trying this:

/usr/bin/rsync -aSHPv --delete --delete-excluded --exclude=.git/objects/ --exclude='*-building/' sundries01::docs/ /srv/web/docs.fedoraproject.org/

which was not working as well as I hoped.

If the directory name is stable.. it is easy to exclude it. I was trying this:
/usr/bin/rsync -aSHPv --delete --delete-excluded --exclude=.git/objects/ --exclude='*-building/' sundries01::docs/ /srv/web/docs.fedoraproject.org/

which was not working as well as I hoped.

Yes, it is, and there's only one now: copying.tmp.

...well, will be, once the pod I mentioned above terminates.

Cron job should exclude copying.tmp now.

So, one proxy we have has a lot less disk space than the others (proxy05) and we have been considering dropping it for a while now, so if we do that we have more space to work with.

I also note that every build all the *.html files change (I guess because they list the build time in them?). Would it be possible to just make a file with the site build time in it and include it (to avoid syncing everything everytime)?

Some of the ways we were doing the sync didn't preserve hardlinks correctly.

The translation build seems to update timestamps on: index files (see above) all the directories, and also all the images... so they get transfered everytime. Can we perhaps have it preserve timestamps if nothing has changed?

Looking at the script at https://pagure.io/fedora-docs/docs-fp-o/blob/prod/f/build-scripts/rebuild-site.py can we change any 'cp -r' there to 'cp -al' ? That should copy with hard links... and preserve timestamps and such.

I have our sync script now using checksums instead of looking at timestamps, not looking at directory timestamps (only files) and trying to preserve hardlinks.

However, it still seems to break hardlinks and/or transfer more than it should. :( I've commited the changes I have now, and we can revisit or others can investigate from there...

Code was fixed with cp -al instead of -r
Execution time was reduced again from 2h30 to 35 minutes=2E Thank you :)
Please re enable the cron job in production to see the impact=2E I hope it=
will be enough=2E

So, I was too fast.
Full size currently is 9,7 Go for all languages, and about 248 Mo per language:

Capture_decran_de_2020-06-15_18-34-41.png

When I look at it, we can see some folders are all the same time, those in Fedora administrator guide:

Capture_decran_de_2020-06-15_18-34-55.png

With fedora administator guide picture not being optimized we have 3.6 Go only with this. We could easily reduce it to 2Go with optipng.

We also loose 1.1 Go because of a PDF of 27,4 Mo : https://docs.fedoraproject.org/en-US/fedora-silverblue/_attachments/container-commandos.pdf that we duplicate 40 times.

Does it worth the effort to save this space?

Code was fixed with cp -al instead of -r

-l applies to symlinks, -H to preserve hardlinks

Sorry I was thinking of rsync.

I doubt anyone's going to be translating PDFs in general anytime soon, so how about updating the script to exclude anything that ends with .pdf? We'd have to make sure any links in the translated versions point to the English PDF, otherwise it's going to give 404s.

Actually. How about we move the PDFs into the repo root, instead of their current location in modules/ROOT/attachments, and update the docs to link to the repo (in this case, to github)? That should exclude them from the script while ensuring they're still accessible, and the link will stay the same in every localized version since it's pointing outside the site, right?

Also, I'll go check the sysadmin guide pics and see if I can cut them down in size.

I doubt anyone's going to be translating PDFs in general anytime soon, so how about updating the script to exclude anything that ends with .pdf? We'd have to make sure any links in the translated versions point to the English PDF, otherwise it's going to give 404s.

We copy assets (images, PDFs, etc.) because we don't translate any of these.
Copy the content is the easiest way we found to prevent complexity, because Antora doesn't provide internationalization features.

Actually. How about we move the PDFs into the repo root, instead of their current location in modules/ROOT/attachments, and update the docs to link to the repo (in this case, to github)? That should exclude them from the script while ensuring they're still accessible, and the link will stay the same in every localized version since it's pointing outside the site, right?

Yep, it may be useful but it adds complexity and makes our system less autonomous.
Maybe it is easier to just wait for Antora upstream support of internationalization.

Also, I'll go check the sysadmin guide pics and see if I can cut them down in size.

I can do that also, but @kevin have to tell us if it is useful, because he says the default configuration was 20Go, we are less than 50%.
If it takes a few more min to sync, it is fine.

So right now, something is still messing up the hardlinks... I need to sit down and figure out what that is and if we can solve it in the sync script.

With hardlinks I think things are managable now, but without they grow pretty big. I'll try and debug this and see how things look in the next few days.

Ok, I understand there is no need to optimize images for now, and that we need to wait for you to dig.

@pbokoc : I have other tiny patches to apply to the whole documentation, I can include image optimization too, but I'll ask Antora team first if there is a way to reduce duplication and optimize files at publication (css, js, png, jpg, svg, pdf, etc.).

So, I rolled out the new image with the cp -al... On the plus side it saves a bunch of space.
On the downside, it spews a ton of errors:

cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-java-packaging-howto.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-java-packaging-howto.xml': Invalid cross-device link
cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-localization.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-localization.xml': Invalid cross-device link
cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-mentored-projects.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-mentored-projects.xml': Invalid cross-device link
cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-mindshare.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-mindshare.xml': Invalid cross-device link
cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-mindshare-committee.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-mindshare-committee.xml': Invalid cross-device link
cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-minimization.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-minimization.xml': Invalid cross-device link
cp: cannot create hard link '/antora/output/copying.tmp/public/zh_TW/sitemap-modularity.xml' to '/tmp/tmpdmnxkx5c/docs_repo/public/zh_TW/sitemap-modularity.xml': Invalid cross-device link

and the site at the end seems to go to the old docs site?

perhaps it would be easier to move the cp back to 'cp -a' and then at the very end when all sites are in place on the final disk, run 'hardlink -v /path' on them?

So I'm a little confused about using hardlinks here...

The script uses cp to get the content from a local partition where it's built to a mounted partition from which it is published. How would hardlinks help here? It's two different devices.

Also, because the copying takes some time, so it's copied to a temporary directory, so sync won't happen mid-build. When it's all copied over, it uses mv to get the files to the right place.

When the website is in place, there are no two similar files we could use hardlinks for. Even in cases where one page is basically the same for multiple languages, because it hasn't been translated yet, it has the language selector set to its language. So they're never same. Also, the menus (built out of other pages) can be translated.

Am I missing something here?

images, pdfs and other files are all able to be hardlinked after the move. This brings down the dskspace greatly.

Yeah, images, svg's?

Right now the output is... a single index.html file and a bunch of directories. ;(

So, can you remove the 'cp -al' (move back to just cp -a?)

Then, at the very end, run a 'hardlink -v' only on the final output dir?

For reference, at least the CoreOS docs pages are now returning 403: https://docs.fedoraproject.org/en-US/fedora-coreos/.

For reference, at least the CoreOS docs pages are now returning 403: https://docs.fedoraproject.org/en-US/fedora-coreos/.

That was related to the datacenter move, not this issue.

Also, looks like we're rebuilding again:

Last build: 2020-06-22 03:42:10 UTC

Wooo!

Metadata Update from @kevin:
- Issue assigned to kevin

3 years ago

ok, we are getting close to having this fixed. ;)

The hardlink works great and saves a ton of space for the localization things.

My next question is about timing.

In ansible we have both the default en 'cron' running every hour at 00, and the translation version cron running every hour at 00 and the rsync from the docs volume to proxies at 10 after the hour.
In real life for some reason we have:

NAME              SCHEDULE     SUSPEND   ACTIVE    LAST SCHEDULE   AGE
cron              42 * * * *   False     0         20m             10d
cron-translated   49 * * * *   False     1         13m             10d

And the rsync cron job off (due to this issue).

Cron runs are taking about 3min
cron-translated runs are taking about 56m
the rsync varies, but 3-5min

So, what should we have here? Should the cron run and finish and then cron-translated run and rsync? Can the rsync run anytime, or must it run when a cron is not in progress to have a stable copy?

Well, splitting English and translated builds makes sense for testing, but in production it only makes sense if we consider there is a priority between languages.
Once in production, if either english or translated content is broken, we should fix it with the same level of priority.

One single cron with BUILD_LANGS=all would makes more sense to me.

I'm unsure about the rsync question as I don't really know what the impact of running rsync too early.

And here is a question about the schedule: do we really need hourly builds? This consumes a lot of resources. Our fedora websites are published once a day, and it has been working fine for years.

Well, splitting English and translated builds makes sense for testing, but in production it only makes sense if we consider there is a priority between languages.

Agreed.

And here is a question about the schedule: do we really need hourly builds? This consumes a lot of resources. Our fedora websites are published once a day, and it has been working fine for years.

Need? No. But the more frequently we can build docs, the better. Daily would be too infrequent unless there's a compelling restraint. I'm open to something like every other hour, but I'd rather see it hourly unless that's causing problems. Since each build takes nearly an hour, going shorter doesn't make sense until we can shorten the build time.

Well, the translation script converting english adoc -> pot and po files to translated adoc is run once per day.

This translation script takes about 40 minutes to execute.

Whatever we decide, we need to synchronize the jobs.

ok. I re-enabled the job at :55 after the hour. That seems to be working.

If we need to adjust the times on the jobs, just file a new ticket or re-open this one.

We were even able to re-enable our low disk space instance. :)

Thanks for all the work here folks...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

I'm reading conflicting details between this ticket (where @pbokoc says the 403 is expected due to datacenter move) and a mail thread on Fedora devel@ (where @kevin says it isn't the datacenter move but a docs pipeline issue, presumably this one?).

This ticket has been closed, but the docpages are still returning 403. Is there another ticket tracking this that I'm missing?

I'm reading conflicting details between this ticket (where @pbokoc says the 403 is expected due to datacenter move) and a mail thread on Fedora devel@ (where @kevin says it isn't the datacenter move but a docs pipeline issue, presumably this one?).
This ticket has been closed, but the docpages are still returning 403. Is there another ticket tracking this that I'm missing?

The docs were working for most of yesterday, now they're partially broken (right now I see page content but CSS and images don't load). I haven't seen a thread about this on devel@ (could you post the subject line so I can search for it?), but it really shouldn't have anything to do with this issue as far as I'm aware. https://status.fedoraproject.org/ shows that docs are indeed being moved right now, so things breaking in various ways is expected.

You can use https://status.fedoraproject.org/ to monitor the status of various Fedora services, including docs.

Oh sorry, the actual message was this one which states "it should be back up for everyone".

Myself and other people are still seeing 403 on docs today though: https://github.com/coreos/fedora-coreos-tracker/issues/550.

i received same feedback from translators and can experience it myself=2E l=
et's hope it indeed is related to the data center move

I wouldn't think the DC move would cause a 403. Are the ownership and permissions getting set correctly on the output post-copy?

Is there any way we could make the translations just get generated for things that changed since the last run? If I run make on a huge software development project it has some intelligence to only build what changed since last time. I imagine 99% of things don't change every hour, but when you do want a change, a lot of times you want it now. For example, we were running a test day a few weeks ago and there were some docs changes we realized we needed on that day so each person didn't hit the same problem.

It's not the datacenter move.

It's likely some timing issue here. ;( Will look.

Metadata Update from @kevin:
- Issue status updated to: Open (was: Closed)

3 years ago

yes @dustymabe we could use the git log https://pagure.io/fedora-docs/translated-sources/commits/
but we need to figure out how to make it in a reliable manner, which will again take lot of time. I would like to publish this...

@kevin : would you please keep one build per hour in English and one build per day in translated version?
The translated content publication should happen aver this cron: Cron _update_docs_trans@sundries01 /usr/local/bin/lock-wrapper cron-docs-translation-update "/usr/local/bin/docs-translation-update"

I think docs-translation-update is launched at 22.00 UTC (I receive the email at 00:00 with Europe/Paris TimeZone).
Launching the translated content publication at 23.00 UTC would make sure it's fine (ocs-translation-update takes 40 minutes, without any optimization).

ok. I've done the following:

  • made the docs-rsync script idempotent (ie, if nothing changes on sundries, it doesn't change anything on proxies either).
  • made the docs-rsync script run at :55 after the hour.
  • made the en docs build run at :50 after the hour (it finishes usually in about 3min or so)
  • made the translation build run at 23:00 UTC once per day.

I have confirmed that all proxies are updated and appear to work now.

Everything should be working now... if it is not, please re-open again... if you see a 403 or something, please try and look at the headers and see what the Server ip you hit was so I can look at that specific proxy for issues.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

thanks a lot!

Jean-Baptiste

Login to comment on this ticket.

Metadata