#11 Fuzzy search functionality in locate
Opened 14 years ago by salviati. Modified 6 years ago

I have modified locate.c so that locate can now do fuzzy (substring) searches (both case sensitive and insensitive) via parameter -F (or --fuzzy) d, where d is the Levenshtein distance.


Thanks for the patch.

I'm not sure many users will be able to use the feature (try explaining Levenshtein distance to the average user :) ), but I'd be happy to apply a clean patch.

Could you do at least the following changes, please?

  • Variables with values related to string lengths (including the computed distance and the configuration variable) should have type size_t.
  • Replace MIN3 by open-coding the two necessary comparisons - MIN3 as written does more comparisons than necessary.
  • (haystack, needle, n, m) would be better (s1, s1, s1_len, s2_len) I think (*_len is easily matched to the correct string, and naming them haystack/needle does not make sense if you interchange them.
  • Can you easily make the code generic enough to avoid interchanging needle and haystack? If not, please exchange them manually instead of doing a recursive call.
  • Avoid variable-length arrays - best would probably be to use a "static struct obstack" that is reset after each use.
  • The way "p" is set to 0 is obfuscated - and "p" and "j" can be combined.
  • Can the two levenshtein_distance_* functions be defined by one macro, to make sure any algorithm changes are done on both versions?
  • If conf_match_fuzzy makes conf_patterns_simple irrelevant, this should be documented with the variables and conf_patterns_simple should not even be computed. This also implies that conf_uppercase_patterns needs to be computed in all necessary cases.
  • Related to that - locate(1) should reject regexps or glob patterns together with --fuzzy. Define the rules for determining the matching algorithm once, and apply them consistently throughout the code.
  • The argument to -F needs to be rigorously verified - see how the -l argument is handled.
  • Can you prepare a man page patch as well?
  • Please put all declarations in a block before statements.
  • Please adhere to GNU coding standards: newline between if() and body, {} braces around a body of if/while that contains an if/while, space between "(", etc.

Replying to [comment:1 mitr]:

Thanks for the patch.

I'm not sure many users will be able to use the feature (try explaining Levenshtein distance to the average user :) ), but I'd be happy to apply a clean patch.

Could you do at least the following changes, please?

  • Variables with values related to string lengths (including the computed distance and the configuration variable) should have type size_t.
  • Replace MIN3 by open-coding the two necessary comparisons - MIN3 as written does more comparisons than necessary.
  • (haystack, needle, n, m) would be better (s1, s1, s1_len, s2_len) I think (*_len is easily matched to the correct string, and naming them haystack/needle does not make sense if you interchange them.
  • Can you easily make the code generic enough to avoid interchanging needle and haystack? If not, please exchange them manually instead of doing a recursive call.
  • Avoid variable-length arrays - best would probably be to use a "static struct obstack" that is reset after each use.
  • The way "p" is set to 0 is obfuscated - and "p" and "j" can be combined.
  • Can the two levenshtein_distance_* functions be defined by one macro, to make sure any algorithm changes are done on both versions?
  • If conf_match_fuzzy makes conf_patterns_simple irrelevant, this should be documented with the variables and conf_patterns_simple should not even be computed. This also implies that conf_uppercase_patterns needs to be computed in all necessary cases.
  • Related to that - locate(1) should reject regexps or glob patterns together with --fuzzy. Define the rules for determining the matching algorithm once, and apply them consistently throughout the code.
  • The argument to -F needs to be rigorously verified - see how the -l argument is handled.
  • Can you prepare a man page patch as well?
  • Please put all declarations in a block before statements.
  • Please adhere to GNU coding standards: newline between if() and body, {} braces around a body of if/while that contains an if/while, space between "(", etc.

Hi,

Thanks for welcoming the patch :) I tried to handle the issues, but I kinda left some stuff out:
I left MIN3 as it is, 'cause I couldn't think of writing a similar macro without introducing a temporary variable. I don't think it's a disaster to have one extra cmp+jmp in 2/3 cases either.
Substring Levenshtein wasn't supposed to be symmetric anyway --I don't remember what I was think back then. I simply removed the recursive call and made sure the ordering is correct.
I tried to combing wchar & char versions, though it now looks kinda ugly
Why not VLA? I replaced them with xmalloc calls anyway
That pesky p is gone :)
I haven't really used GNU coding style at all, so I simply filtered it through indent with -gnu switch.

All other stuff is done.
BTW, -R option wasn't documented in man page or help(), so I added that in as well.

BTW, fuzzy search and regex search are mutually exclusive for now, but I'm considering to add approximate regex as well.

What do you think about introducing some more low-level controls, such as fine-tuning a weights of insert, delete, substitude for calculating Levensthein distance, maybe via long-only-options?

Free (under BSD license) fuzzy regex library tre let's such parameters to be tuned already, and we could easily add these to --no-tre version as well.

Thanks for the update, I'll try to clean it up a bit and apply it in a few days.

As for more low-level controls - I don't think that fits the Unix way of doing things. You can combine separate tools, e.g. (locate / | fuzzy_grep --any-detailed-option), easily enough. (Even the simple --fuzzy option is a bit borderline.)

I don't think ''that'' is the UNIX way.
Following that logic, you be removing the regex and globbing as well since we have grep, and add a frontend script that filters the whole database via grep. They are, you see, identical to fuzzy option in that sense.

Here's a way to look at this. We have a library, a re-usable code which is in a sense another "module", just like the external agrep is. And since agrep offers these options, why should we?

Ahem, why should'''n't''' we :)

Metadata Update from @salviati:
- Issue assigned to mitr

7 years ago

Metadata Update from @mitr:
- Assignee reset

6 years ago

Login to comment on this ticket.

Metadata