readme.txt in Fuzzy Search 6
Fuzzy Search Module Read Me Project Home: http://drupal.org/project/fuzzysearch
=== Installation ===
To install this module simply:
1. Install as usual, see http://drupal.org/node/70151 for further information.
2. Configure permissions for which roles can: Search content, Administer the
modules, View scoring information, and View debugging information.
3. At admin/build/block put the "Fuzzy search form" into a region in your theme.
4. Configure the module at /admin/settings/fuzzysearch.
5. Run cron until your site is 100% indexed.
6. Consider using a stopwords file to keep common words from bloating your
index. See fuzzysearch/stopwords/README.txt
7. Set up a regular cron job to keep your site fully indexed.
=== What is indexed? ===
Currently this module indexes all filtered node content, taxonomy terms
associated with the nodes, cck text fields associated with a node, comments left
on the node, and any text being returned by the call to hook_nodeapi with the
$op = 'update index'.
=== Fuzzysearch Index Settings ===
--Any time you change an index setting you must reindex your site for
the change to take effect.--
* You can choose the ngram length. This is the size of the chunks words are
broken into on indexing and searching. The default value is 3. The lower the
value, the more results (and more noise) you will get. Also, a lower value will
increase the size of your fuzzysearch_index table.
* Nodes to index per cron run: Adjust this lower if php is timing out during
cron and Fuzzysearch is the culprit.
* HTML tag scoring settings: A score is assiged to each of the tags listed. You
can adjust this to your preference. If, for example, links are especially
important, raise the a tag score. If you don't want any extra importance for
titles, set the h1 tag to 0.
* Rebuild index: Check this to requeue all nodes for indexing without clearing
the index.
* Rebuild and clear index: Check this to requeue all nodes for indexing and
clear the index. Searching will be incomplete until all nodes are once again
indexed.
=== Fuzzysearch Search Settings ===
Assume missing letters in search terms:
A search term as entered by a user may be missing letters. If you want to search
for longer words than the user has entered, you can increase this number. In
English for example, you will need at least 1 if you want to return a plural
search term ending with "s" from a singular word. 0 means the term will not
return longer words in the results. 1 means a 4 letter search term will also
check 5 letters words in the index.
Assume extra letters in search terms:
A search term as entered by a user may have extra letters. If you want to search
for shorter words than the user has entered, you can increase this number. In
English for example, you will need at least 1 if you want to return a singular
word from a plural search term ending with "s". 0 means the term will not
return shorter words in the results. 1 means a 5 letter search term will also
check 4 letters words in the index.
Minimum completeness:
When indexed, each ngram is saved with the percentage of the word it belongs to.
"app" is 33.33% of "apple" because it is one ngram out of three (app, ppl, ple).
When searching, Fuzzy search lets you specify a minimum sum of percentages of
the ngrams it finds in each node. A lower number lets in more noise. Enter a
value between 0 and 100 to set the completeness required in the returned results.
It is best to set this value 10 points below your ideal minimum percentage. So
if you wanted to match results with at least 50% of the word matching, set this
value to 40. The match is calculated per indexed word and not by the search
phrase, ensuring that matches are relevant to the words in the phrase and not
just all the letter combinations in the phrase.
Also note that when a phrase matches more than a single word the completeness
can be higher than 100%, this is because the completeness of each word is summed
and then sorted as a measure of accuracy in the result set.
* You can filter results output by node type. This does not affect search
indexing, so you don't have to reindex if you change this.
*Checking the "Display scoring" checkbox is helpful for debugging when you are
trying to fine tune score modifiers. It will output completeness and score
values under each of the returned results.
=== Fuzzysearch Display Settings ===
* Search results path:
Choose the search results path, for example: search/results. Do not use leading
or trailing slashes. The path must be unique in your site.
* Sort by score
If selected, the results will be sorted by score first and completeness second,
which can make tag scores even more important. The default is to sort by
completeness first. You may want to try this if you find high scoring nodes
being pushed down in the results below lower scoring nodes with higher
completeness.
*Checking the "Display debugging information" checkbox can also help you
understand how Fuzzysearch queries the index. You'll see the query and the
ngrams and also the regex used when highlighting misspelled or partial words.
*Result excerpt length:
Set the length of the displayed text excerpt surrounding a found search term.
Applies per found term.
Maximum result length:
Set the maximum length of the displayed result. Set to 0 for unlimited length.
Applies per result.
Minimum spelling score:
Fuzzysearch tries to highlight search terms that may be misspelled. You can set
the minimum threshold, which is calculated as a ratio of ngram hits to misses in
a term. 0 may cause a misspelling to highlight everything, and 100 will only
highlight exact terms. Enter value between 0 and 100.
* Fuzzy Search will try to highlight misspelled words. You can
adjust the accuracy by setting a minimum spelling score, which is calculated as
a ratio of ngram hits to misses in a term, from 0 to 100, where 100 means no
misspellings are highlighted.
This works by replacing bad (misspelled, missing letters, extra letters) ngrams
with a wildcard. It is possible to get false matches. For example, searching for
"rendition" will also highlight "condition" if your spelling score is low
enough. However, these kinds of matches are likely to have lower score
completeness and be sorted to the bottom of your results if your search term
exists in your content.
=== Fuzzy Search Blocks ===
* Fuzzy search form: This block provides the form where users will entier the
search terms.
* Fuzzy search title query: The Drupal 6 version provides a block that performs a
fuzzysearch on a query in the path. This may be performance intensive, so use
with caution. If the fuzzysearch query is in the path, the block will return
search matches of node titles. It's up to you to put the query in your path like
this:
http://example.com/node/add/question?fuzzysearch=arthritis%20knees%20pain
This would be good for similar content blocks, or to suggest existing content
before letting the user create new content.
=== Theming ===
The module provides a template, fuzzysearch-result.tpl.php, that
you can copy to your theme folder and modify. This affects the search results
page. There are some theme functions you can override to theme the fuzzysearch
block, and you can also override block.tpl.php.
=== About Fuzzy Search ===
This module provides a fuzzy matching search engine for nodes.
Nodes are indexed when the site's cron job is run. The module automatically
queues a node for indexing once it is submitted, updated or a comment has been
made on it. Nodes can also be queued for reindexing by other modules when the
function fuzzysearch_reindex($nid, $module) is called, Where $nid is the nid of
the node to have reindexed and $module is a string containing an identifier of
the module calling for the node to be reindexed.
Fuzzy matching is implemented by using qgrams. Each word in a node is split
into 3 (default) letter lengths, so 'apple' gets indexed with 3 smaller strings
'app', 'ppl', 'ple'. The effect of this is that as long as your search matches
X percentage (administerable in the admin settings) of the word the node will be
pulled up in the results. One issue that is inherent with this method is cases
when a user searches for a word like 'athens' which contains the word 'the'
within it and has a completeness of 100%. In order to account for this
larger length words qgrams must match qgrams from words with a similar length.
This is an imperfect solution but it does a good job of returning the most
relevant results.
=== Fuzzysearch Submodules ===
Fuzzysearch comes with the following example submodules:
1. fuzzysearch_filter_example
When enabled, this module provides an example of how to use
hook_fuzzysearch_filter().
See the API section below for information about this hook.
=== Fuzzysearch API ===
=== hook_fuzzysearch_score($op, $node) ===
This hook allows other contributed modules to modify the score of any
node being indexed. This affects nodes, not words. Site administrators can then
set how important these modifiers are to their particular site's use. Changing
the modifier score to 0 means that the modification being returned by that
particular module will have no effect on the scoring of the nodes on the site.
Setting the modifier to 10 means it will have maximum effect.
This simple example code from a contributed module implementing the scoring hook
returns a score multiplier of 5 if the node author is user 1. Any time a node is
changed fuzzysearch will apply the modifiers on the next cron run. You must
reindex your site to affect existing nodes, or resave the nodes.
/**
* Implementation of hook_fuzzysearch_score
* @param $op 'settings' returns array with information about the module (seen
* in the admin settings form) 'index' returns a score modifier to the node
* being indexed.
*/
function custom_fuzzysearch_score($op, $node) {
switch ($op) {
case 'settings':
$info[] = array(
'id' => 'user_1',
'title' => t('Author is User 1'),
'description' => t('This multiplier lets you increase the score of nodes authored by user 1.'),
);
return $info;
break;
case 'index':
$score = $node->uid == 1 ? 5 : 0;
$scores[] = array(
'id' => 'user_1',
'score' => $score,
);
return $scores;
}
}
=== hook_fuzzysearch_index($node) ===
Before fuzzysearch indexes a node, other modules have the chance to change the
node or prevent it from being indexed. If a module implements this hook and
returns FALSE the node will not be indexed. If it already existed in the index
it will be removed. Modules should check that they have a node object to work
with, as another module may have already returned FALSE.
Changes returned to the node object are only reflected in the fuzzysearch index,
not in the node as saved in the database.
Some uses for this hook include preventing a node type from being indexed,
boosting a node's search score for certain words, or adding additional text to
a node. This is slightly different than hook_nodeapi's update index operation in
that you can replace or change parts of the node rather than just adding text.
Example code of a contributed module implementing the indexing hook:
// Prevent private nodes from being indexed by fuzzy search.
function custom_fuzzysearch_index($node) {
if (!is_object($node) || $node->type == 'private') {
return FALSE;
}
else {
return $node;
}
}
=== hook_fuzzysearch_filter($op, $text) ===
Hook_fuzzysearch_filter($text) gives modules an opportunity to filter the the text
to be indexed before it is indexed and/or searched. The common use for this is
to do more complicated filtering than is allowed by the stop words text files.
$op == 'index' will filter words on indexing of content. $op == 'search' will
filter the search terms before the index is searched for results.
=== About The Author ===
Drupal 6 version maintained by awolfey.
This module was created for Drupal as part of Google Summer of Code 2007 by
Blake Lucchesi www.boldsource.com blake@boldsource.com
File
readme.txt
View source
- Fuzzy Search Module Read Me Project Home: http://drupal.org/project/fuzzysearch
-
- === Installation ===
-
- To install this module simply:
- 1. Install as usual, see http://drupal.org/node/70151 for further information.
- 2. Configure permissions for which roles can: Search content, Administer the
- modules, View scoring information, and View debugging information.
- 3. At admin/build/block put the "Fuzzy search form" into a region in your theme.
- 4. Configure the module at /admin/settings/fuzzysearch.
- 5. Run cron until your site is 100% indexed.
- 6. Consider using a stopwords file to keep common words from bloating your
- index. See fuzzysearch/stopwords/README.txt
- 7. Set up a regular cron job to keep your site fully indexed.
-
- === What is indexed? ===
-
- Currently this module indexes all filtered node content, taxonomy terms
- associated with the nodes, cck text fields associated with a node, comments left
- on the node, and any text being returned by the call to hook_nodeapi with the
- $op = 'update index'.
-
- === Fuzzysearch Index Settings ===
-
- --Any time you change an index setting you must reindex your site for
- the change to take effect.--
-
- * You can choose the ngram length. This is the size of the chunks words are
- broken into on indexing and searching. The default value is 3. The lower the
- value, the more results (and more noise) you will get. Also, a lower value will
- increase the size of your fuzzysearch_index table.
-
- * Nodes to index per cron run: Adjust this lower if php is timing out during
- cron and Fuzzysearch is the culprit.
-
- * HTML tag scoring settings: A score is assiged to each of the tags listed. You
- can adjust this to your preference. If, for example, links are especially
- important, raise the a tag score. If you don't want any extra importance for
- titles, set the h1 tag to 0.
-
- * Rebuild index: Check this to requeue all nodes for indexing without clearing
- the index.
-
- * Rebuild and clear index: Check this to requeue all nodes for indexing and
- clear the index. Searching will be incomplete until all nodes are once again
- indexed.
-
- === Fuzzysearch Search Settings ===
-
- Assume missing letters in search terms:
- A search term as entered by a user may be missing letters. If you want to search
- for longer words than the user has entered, you can increase this number. In
- English for example, you will need at least 1 if you want to return a plural
- search term ending with "s" from a singular word. 0 means the term will not
- return longer words in the results. 1 means a 4 letter search term will also
- check 5 letters words in the index.
-
- Assume extra letters in search terms:
- A search term as entered by a user may have extra letters. If you want to search
- for shorter words than the user has entered, you can increase this number. In
- English for example, you will need at least 1 if you want to return a singular
- word from a plural search term ending with "s". 0 means the term will not
- return shorter words in the results. 1 means a 5 letter search term will also
- check 4 letters words in the index.
-
- Minimum completeness:
- When indexed, each ngram is saved with the percentage of the word it belongs to.
- "app" is 33.33% of "apple" because it is one ngram out of three (app, ppl, ple).
- When searching, Fuzzy search lets you specify a minimum sum of percentages of
- the ngrams it finds in each node. A lower number lets in more noise. Enter a
- value between 0 and 100 to set the completeness required in the returned results.
-
- It is best to set this value 10 points below your ideal minimum percentage. So
- if you wanted to match results with at least 50% of the word matching, set this
- value to 40. The match is calculated per indexed word and not by the search
- phrase, ensuring that matches are relevant to the words in the phrase and not
- just all the letter combinations in the phrase.
-
- Also note that when a phrase matches more than a single word the completeness
- can be higher than 100%, this is because the completeness of each word is summed
- and then sorted as a measure of accuracy in the result set.
-
- * You can filter results output by node type. This does not affect search
- indexing, so you don't have to reindex if you change this.
-
- *Checking the "Display scoring" checkbox is helpful for debugging when you are
- trying to fine tune score modifiers. It will output completeness and score
- values under each of the returned results.
-
- === Fuzzysearch Display Settings ===
-
- * Search results path:
- Choose the search results path, for example: search/results. Do not use leading
- or trailing slashes. The path must be unique in your site.
-
- * Sort by score
- If selected, the results will be sorted by score first and completeness second,
- which can make tag scores even more important. The default is to sort by
- completeness first. You may want to try this if you find high scoring nodes
- being pushed down in the results below lower scoring nodes with higher
- completeness.
-
- *Checking the "Display debugging information" checkbox can also help you
- understand how Fuzzysearch queries the index. You'll see the query and the
- ngrams and also the regex used when highlighting misspelled or partial words.
-
- *Result excerpt length:
- Set the length of the displayed text excerpt surrounding a found search term.
- Applies per found term.
-
- Maximum result length:
- Set the maximum length of the displayed result. Set to 0 for unlimited length.
- Applies per result.
-
- Minimum spelling score:
- Fuzzysearch tries to highlight search terms that may be misspelled. You can set
- the minimum threshold, which is calculated as a ratio of ngram hits to misses in
- a term. 0 may cause a misspelling to highlight everything, and 100 will only
- highlight exact terms. Enter value between 0 and 100.
-
- * Fuzzy Search will try to highlight misspelled words. You can
- adjust the accuracy by setting a minimum spelling score, which is calculated as
- a ratio of ngram hits to misses in a term, from 0 to 100, where 100 means no
- misspellings are highlighted.
-
- This works by replacing bad (misspelled, missing letters, extra letters) ngrams
- with a wildcard. It is possible to get false matches. For example, searching for
- "rendition" will also highlight "condition" if your spelling score is low
- enough. However, these kinds of matches are likely to have lower score
- completeness and be sorted to the bottom of your results if your search term
- exists in your content.
-
- === Fuzzy Search Blocks ===
-
- * Fuzzy search form: This block provides the form where users will entier the
- search terms.
-
- * Fuzzy search title query: The Drupal 6 version provides a block that performs a
- fuzzysearch on a query in the path. This may be performance intensive, so use
- with caution. If the fuzzysearch query is in the path, the block will return
- search matches of node titles. It's up to you to put the query in your path like
- this:
-
- http://example.com/node/add/question?fuzzysearch=arthritis%20knees%20pain
-
- This would be good for similar content blocks, or to suggest existing content
- before letting the user create new content.
-
- === Theming ===
-
- The module provides a template, fuzzysearch-result.tpl.php, that
- you can copy to your theme folder and modify. This affects the search results
- page. There are some theme functions you can override to theme the fuzzysearch
- block, and you can also override block.tpl.php.
-
- === About Fuzzy Search ===
-
- This module provides a fuzzy matching search engine for nodes.
- Nodes are indexed when the site's cron job is run. The module automatically
- queues a node for indexing once it is submitted, updated or a comment has been
- made on it. Nodes can also be queued for reindexing by other modules when the
- function fuzzysearch_reindex($nid, $module) is called, Where $nid is the nid of
- the node to have reindexed and $module is a string containing an identifier of
- the module calling for the node to be reindexed.
-
- Fuzzy matching is implemented by using qgrams. Each word in a node is split
- into 3 (default) letter lengths, so 'apple' gets indexed with 3 smaller strings
- 'app', 'ppl', 'ple'. The effect of this is that as long as your search matches
- X percentage (administerable in the admin settings) of the word the node will be
- pulled up in the results. One issue that is inherent with this method is cases
- when a user searches for a word like 'athens' which contains the word 'the'
- within it and has a completeness of 100%. In order to account for this
- larger length words qgrams must match qgrams from words with a similar length.
- This is an imperfect solution but it does a good job of returning the most
- relevant results.
-
- === Fuzzysearch Submodules ===
-
- Fuzzysearch comes with the following example submodules:
-
- 1. fuzzysearch_filter_example
- When enabled, this module provides an example of how to use
- hook_fuzzysearch_filter().
-
- See the API section below for information about this hook.
-
- === Fuzzysearch API ===
-
- === hook_fuzzysearch_score($op, $node) ===
-
- This hook allows other contributed modules to modify the score of any
- node being indexed. This affects nodes, not words. Site administrators can then
- set how important these modifiers are to their particular site's use. Changing
- the modifier score to 0 means that the modification being returned by that
- particular module will have no effect on the scoring of the nodes on the site.
- Setting the modifier to 10 means it will have maximum effect.
-
- This simple example code from a contributed module implementing the scoring hook
- returns a score multiplier of 5 if the node author is user 1. Any time a node is
- changed fuzzysearch will apply the modifiers on the next cron run. You must
- reindex your site to affect existing nodes, or resave the nodes.
-
- /**
- * Implementation of hook_fuzzysearch_score
- * @param $op 'settings' returns array with information about the module (seen
- * in the admin settings form) 'index' returns a score modifier to the node
- * being indexed.
- */
-
- function custom_fuzzysearch_score($op, $node) {
- switch ($op) {
- case 'settings':
- $info[] = array(
- 'id' => 'user_1',
- 'title' => t('Author is User 1'),
- 'description' => t('This multiplier lets you increase the score of nodes authored by user 1.'),
- );
- return $info;
- break;
- case 'index':
- $score = $node->uid == 1 ? 5 : 0;
-
- $scores[] = array(
- 'id' => 'user_1',
- 'score' => $score,
- );
- return $scores;
- }
- }
-
- === hook_fuzzysearch_index($node) ===
- Before fuzzysearch indexes a node, other modules have the chance to change the
- node or prevent it from being indexed. If a module implements this hook and
- returns FALSE the node will not be indexed. If it already existed in the index
- it will be removed. Modules should check that they have a node object to work
- with, as another module may have already returned FALSE.
-
- Changes returned to the node object are only reflected in the fuzzysearch index,
- not in the node as saved in the database.
-
- Some uses for this hook include preventing a node type from being indexed,
- boosting a node's search score for certain words, or adding additional text to
- a node. This is slightly different than hook_nodeapi's update index operation in
- that you can replace or change parts of the node rather than just adding text.
-
- Example code of a contributed module implementing the indexing hook:
-
- // Prevent private nodes from being indexed by fuzzy search.
- function custom_fuzzysearch_index($node) {
- if (!is_object($node) || $node->type == 'private') {
- return FALSE;
- }
- else {
- return $node;
- }
- }
-
- === hook_fuzzysearch_filter($op, $text) ===
-
- Hook_fuzzysearch_filter($text) gives modules an opportunity to filter the the text
- to be indexed before it is indexed and/or searched. The common use for this is
- to do more complicated filtering than is allowed by the stop words text files.
-
- $op == 'index' will filter words on indexing of content. $op == 'search' will
- filter the search terms before the index is searched for results.
-
- === About The Author ===
-
- Drupal 6 version maintained by awolfey.
-
- This module was created for Drupal as part of Google Summer of Code 2007 by
- Blake Lucchesi www.boldsource.com blake@boldsource.com