#1055 Fedora Search Engine

Created 7 years ago by mmcgrath

So Fedora needs a search engine. Here are the requirements as I see them:

  • Crawl the websites
  • Search the websites

Preferences: * Python based * Allows programmable keywords [1] * Has some sort of xml or library interface so other applications can use it

[1] Allow us to have control over what pages get displayed for certain keywords

Something I'd like to see out of appropriate candidates is how much they storage they take up. Also, no need to code this ourselves.

we need something that can search more than the wiki. it needs to index fedorahosted.org fedorapeople.org and fedoraproject.org .

there is http://www.mnogosearch.org/ http://www.dataparksearch.org/ http://crawler.archive.org/

sadly none are python, either java or c. ive not found a python one yet.

There is also Perl, which is neither Java nor C. mnoGoSearch and DataparkSearch were already on the wiki status page in Comment 6. We can add Heritrix and note that it's written in Java.

KinoSearch, Namazu, OpenFTS, and Plucene are Perl. KinoSearch and Namazu appear to be actively maintained. OpenFTS has a Python interface.

In the meantime, reassigning this ticket to me.

Replying to [comment:7 ausil]:

we need something that can search more than the wiki. it needs to index fedorahosted.org fedorapeople.org and fedoraproject.org .

It also needs to index docs.fedoraproject.org.

Publican, which generates the structure of the documentation site can incorporate a search form into the navigation menus that it maintains for each language.

FWIW, over on sourceware.org / sources.redhat.com / gcc.gnu.org, we run mnogosearch against the local web sites. It works okay. I believe these servers in the same colocation facility as fedora*, so we could do a trial run without too much fuss.

What is the status with this project?

We now have a dev instance of dpsearch setup at: https://search-dev.fedoraproject.org/search.cgi

it's crawling docs now. Feedback welcome.

Has there been any progress on this since?

We had a dev instance, but it got very very very slow, so we reaped it.

I'd really like to see us try again and see if we can figure out what went wrong.

I think this really has to be revisited. I will take it up and reintroduce it on the mailing list. Too many factors have changed since then. We need to remain relevant. We will need to come up with both short term and long term goals.

Closing as not fixed. Please file new ticket or reopen if someone wants to move this forward again.

Login to comment on this ticket.