#1061 [spike] : investigating export moin wiki pages to static content
Closed: Fixed with Explanation 8 months ago by arrfab. Opened a year ago by arrfab.

Based on the centos Docs SIG meeting day in Brussels : we can try to have a parallel wiki instance from which we can try to extract static html pages and then decide to keep it (or not) as wiki archives
Some links :

https://git.autistici.org/ale/crawl/
https://github.com/iipc/warc2html


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: investigation

a year ago

Metadata Update from @arrfab:
- Issue assigned to arrfab

a year ago

not possible to assign multiple "assignees" on a pagure ticket but for awareness, I'll work on this with @dcavalca to produce a PoC and then propose it as an archive solution for the Docs SIG (based on a discussion with @shaunm )

deployed two ec2 instances for this and @dcavalca will be able to test things.
On the first one the moin role will be applied (c7 ec2 instance) and actual anonymized data imported, to test export to html static files
That's the role of the second ec2 host (c9s)

I have an export solution that seems to work well and is currently chewing through the existing content. Assuming this goes well, should have something ready for polishing sometime tomorrow.

Export was successful and I have a preliminary archive up at https://dcavalca.gitlab.io/wiki-archive (sources: https://gitlab.com/dcavalca/wiki-archive).

Had a quick look and seems really good enough (for a PoC)

@dcavalca , @shaunm ^ worth sending a mail to centos-docs list to now discuss the plan and point to the PoC ? (ideally we can use the other wiki-archives node for this, instead of personal page on gitlab.io but fine for a PoC)

hey @dcavalca and @shaunm : tempted to close this ticket and reclaim the deployed ec2 instances that were just used as a PoC for this.
I guess the plan was now to start a thread on dedicated centos-docs list and list actions there ?

I talked to @shaunm about this yesterday, he's going to post to centos-docs@ soon. The next steps here would be to disable edits on the wiki, do another dump and put it on the dev instance, and do another crawl to catch any edits that happened since then. Once that's done we can start productionizing the static archive.

Well, I'd say that this was just about the spike/PoC but then explaining the plan and when you want to see it going live : the sooner the wiki is offline, the better, but ideally let's have a consensus through a thread on centos-docs list :-)

@dcavalca , @shaunm : any feedback on spike ? can I shutdown the nodes that were used for the test ? any documented process about archiving wiki and when we can proceed ?

Yes, this was discussed in the latest board meeting and I forgot to post an update here. Here's the game plan:

  • lock edits on the wiki
  • take another snapshot and update the copy on wiki.dev
  • re-run the scraper so we have an updated archive
  • update https://gitlab.com/dcavalca/wiki-archive
  • repoint wiki.centos.org to the static archive

@dcavalca , @shaunm : just revisiting open tickets and wondering if we can just schedule it at one point and announce migration.
Other thoughts : why not just importing in a git repository (on git.centos.org) that content so that even when it will not be "moin" powered, we can still eventually (through PR or just git commits) update content (just thinking about SIGs that haven't opted-in for the other doc system and would need in a hurry to reflect a simple change).

As discussed in today's board meeting (https://git.centos.org/centos/board/issue/91), let's move forward with this. We need infra to lock edits on the wiki and take another snapshot and update the copy on wiki.dev.

Then I'll re-run the scraper so we have an updated archive and update https://gitlab.com/dcavalca/wiki-archive with it. We can either serve that from gitlab, or host it on a CentOS instance (your call); once that's settled infra will need to repoint wiki.centos.org to wherever the static archive is hosted.

We should also preserve the moinmoin backup somewhere before sunsetting the old instance, so that we can potentially improve the static archive in the future if we find a better way to convert it.

@dcavalca : thanks for the update. Diving into afk/pto mode myself but so can we revisit that in August ? I'll have to redeploy tmp ec2 instances, as the previous ones were automatically discarded by duffy ci (they were there for tmp tests and not supposed to be remain online for a long time but easy/fast to redeploy)

Sure, we can do this when you get back. Thanks!

As investigation showed us, and PoC was a success, closing this one as we'll track the prod migration in #1245

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

8 months ago

Login to comment on this ticket.

Metadata