You are here

readme.txt in Fuzzy Search 6

Same filename in this branch
  1. 6 readme.txt
  2. 6 stopwords/README.txt
Fuzzy Search Module Read Me Project Home: http://drupal.org/project/fuzzysearch

=== Installation ===

To install this module simply:
1. Install as usual, see http://drupal.org/node/70151 for further information.
2. Configure permissions for which roles can: Search content, Administer the
   modules, View scoring information, and View debugging information.
3. At admin/build/block put the "Fuzzy search form" into a region in your theme.
4. Configure the module at /admin/settings/fuzzysearch.
5. Run cron until your site is 100% indexed.
6. Consider using a stopwords file to keep common words from bloating your
   index. See fuzzysearch/stopwords/README.txt
7. Set up a regular cron job to keep your site fully indexed.

=== What is indexed? ===

Currently this module indexes all filtered node content, taxonomy terms
associated with the nodes, cck text fields associated with a node, comments left
on the node, and any text being returned by the call to hook_nodeapi with the
$op = 'update index'.

=== Fuzzysearch Index Settings ===

--Any time you change an index setting you must reindex your site for
the change to take effect.--

* You can choose the ngram length. This is the size of the chunks words are
broken into on indexing and searching. The default value is 3. The lower the
value, the more results (and more noise) you will get. Also, a lower value will
increase the size of your fuzzysearch_index table.

* Nodes to index per cron run: Adjust this lower if php is timing out during
cron and Fuzzysearch is the culprit.

* HTML tag scoring settings: A score is assiged to each of the tags listed. You
can adjust this to your preference. If, for example, links are especially
important, raise the a tag score. If you don't want any extra importance for
titles, set the h1 tag to 0.

* Rebuild index: Check this to requeue all nodes for indexing without clearing 
the index.

* Rebuild and clear index: Check this to requeue all nodes for indexing and
clear the index. Searching will be incomplete until all nodes are once again
indexed.

=== Fuzzysearch Search Settings ===

Assume missing letters in search terms:
A search term as entered by a user may be missing letters. If you want to search
for longer words than the user has entered, you can increase this number. In
English for example, you will need at least 1 if you want to return a plural
search term ending with "s" from a singular word. 0 means the term will not
return longer words in the results. 1 means a 4 letter search term will also
check 5 letters words in the index.

Assume extra letters in search terms:
A search term as entered by a user may have extra letters. If you want to search
for shorter words than the user has entered, you can increase this number. In
English for example, you will need at least 1 if you want to return a singular
word from a plural search term ending with "s". 0 means the term will not
return shorter words in the results. 1 means a 5 letter search term will also
check 4 letters words in the index.

Minimum completeness:
When indexed, each ngram is saved with the percentage of the word it belongs to.
"app" is 33.33% of "apple" because it is one ngram out of three (app, ppl, ple).
When searching, Fuzzy search lets you specify a minimum sum of percentages of
the ngrams it finds in each node. A lower number lets in more noise. Enter a
value between 0 and 100 to set the completeness required in the returned results.

It is best to set this value 10 points below your ideal minimum percentage. So
if you wanted to match results with at least 50% of the word matching, set this
value to 40.  The match is calculated per indexed word and not by the search
phrase, ensuring that matches are relevant to the words in the phrase and not
just all the letter combinations in the phrase.

Also note that when a phrase matches more than a single word the completeness
can be higher than 100%, this is because the completeness of each word is summed
and then sorted as a measure of accuracy in the result set.

* You can filter results output by node type. This does not affect search
indexing, so you don't have to reindex if you change this.

*Checking the "Display scoring" checkbox is helpful for debugging when you are
trying to fine tune score modifiers.  It will output completeness and score
values under each of the returned results.

=== Fuzzysearch Display Settings ===

* Search results path:
Choose the search results path, for example: search/results. Do not use leading
or trailing slashes. The path must be unique in your site.

* Sort by score
If selected, the results will be sorted by score first and completeness second,
which can make tag scores even more important. The default is to sort by
completeness first. You may want to try this if you find high scoring nodes
being pushed down in the results below lower scoring nodes with higher
completeness.

*Checking the "Display debugging information" checkbox can also help you
understand how Fuzzysearch queries the index. You'll see the query and the
ngrams and also the regex used when highlighting misspelled or partial words.

*Result excerpt length:
Set the length of the displayed text excerpt surrounding a found search term.
Applies per found term.

Maximum result length:
Set the maximum length of the displayed result. Set to 0 for unlimited length.
Applies per result.

Minimum spelling score:
Fuzzysearch tries to highlight search terms that may be misspelled. You can set
the minimum threshold, which is calculated as a ratio of ngram hits to misses in
a term. 0 may cause a misspelling to highlight everything, and 100 will only
highlight exact terms. Enter value between 0 and 100.

* Fuzzy Search will try to highlight misspelled words. You can
adjust the accuracy by setting a minimum spelling score, which is calculated as
a ratio of ngram hits to misses in a term, from 0 to 100, where 100 means no
misspellings are highlighted. 

This works by replacing bad (misspelled, missing letters, extra letters) ngrams
with a wildcard. It is possible to get false matches. For example, searching for
"rendition" will also highlight "condition" if your spelling score is low
enough. However, these kinds of matches are likely to have lower score
completeness and be sorted to the bottom of your results if your search term
exists in your content.

=== Fuzzy Search Blocks ===

* Fuzzy search form: This block provides the form where users will entier the
search terms.

* Fuzzy search title query: The Drupal 6 version provides a block that performs a
fuzzysearch on a query in the path. This may be performance intensive, so use
with caution. If the fuzzysearch query is in the path, the block will return
search matches of node titles. It's up to you to put the query in your path like
this:

http://example.com/node/add/question?fuzzysearch=arthritis%20knees%20pain

This would be good for similar content blocks, or to suggest existing content
before letting the user create new content.

=== Theming ===

The module provides a template, fuzzysearch-result.tpl.php, that
you can copy to your theme folder and modify. This affects the search results
page. There are some theme functions you can override to theme the fuzzysearch
block, and you can also override block.tpl.php.

=== About Fuzzy Search ===

This module provides a fuzzy matching search engine for nodes.
Nodes are indexed when the site's cron job is run.  The module automatically
queues a node for indexing once it is submitted, updated or a comment has been
made on it.  Nodes can also be queued for reindexing by other modules when the
function fuzzysearch_reindex($nid, $module) is called,  Where $nid is the nid of
the node to have reindexed and $module is a string containing an identifier of
the module calling for the node to be reindexed.

Fuzzy matching is implemented by using qgrams.  Each word in a node is split
into 3 (default) letter lengths, so 'apple' gets indexed with 3 smaller strings
'app', 'ppl', 'ple'.  The effect of this is that as long as your search matches
X percentage (administerable in the admin settings) of the word the node will be
pulled up in the results.  One issue that is inherent with this method is cases
when a user searches for a word like 'athens' which contains the word 'the'
within it and has a completeness of 100%. In order to account for this
larger length words qgrams must match qgrams from words with a similar length.
This is an imperfect solution but it does a good job of returning the most
relevant results.

=== Fuzzysearch Submodules ===

Fuzzysearch comes with the following example submodules:

1. fuzzysearch_filter_example
   When enabled, this module provides an example of how to use
   hook_fuzzysearch_filter().

   See the API section below for information about this hook.

=== Fuzzysearch API ===

=== hook_fuzzysearch_score($op, $node) ===

This hook allows other contributed modules to modify the score of any
node being indexed. This affects nodes, not words. Site administrators can then
set how important these modifiers are to their particular site's use. Changing
the modifier score to 0 means that the modification being returned by that
particular module will have no effect on the scoring of the nodes on the site.
Setting the modifier to 10 means it will have maximum effect.

This simple example code from a contributed module implementing the scoring hook
returns a score multiplier of 5 if the node author is user 1. Any time a node is
changed fuzzysearch will apply the modifiers on the next cron run. You must
reindex your site to affect existing nodes, or resave the nodes.

/**
 * Implementation of hook_fuzzysearch_score
 * @param $op 'settings' returns array with information about the module (seen
 *  in the admin settings form) 'index' returns a score modifier to the node
 *  being indexed.
*/

function custom_fuzzysearch_score($op, $node) {
  switch ($op) {
    case 'settings':
      $info[] = array(
        'id' => 'user_1',
        'title' => t('Author is User 1'),
        'description' => t('This multiplier lets you increase the score of nodes authored by user 1.'),
      );
      return $info;
      break;
    case 'index':
      $score = $node->uid == 1 ? 5 : 0;

      $scores[] = array(
        'id' => 'user_1',
        'score' => $score,
      );
      return $scores;
  }
}

=== hook_fuzzysearch_index($node) ===
Before fuzzysearch indexes a node, other modules have the chance to change the
node or prevent it from being indexed. If a module implements this hook and
returns FALSE the node will not be indexed. If it already existed in the index
it will be removed. Modules should check that they have a node object to work
with, as another module may have already returned FALSE.

Changes returned to the node object are only reflected in the fuzzysearch index,
not in the node as saved in the database.

Some uses for this hook include preventing a node type from being indexed,
boosting a node's search score for certain words, or adding additional text to
a node. This is slightly different than hook_nodeapi's update index operation in
that you can replace or change parts of the node rather than just adding text.

Example code of a contributed module implementing the indexing hook:

// Prevent private nodes from being indexed by fuzzy search.
function custom_fuzzysearch_index($node) {
  if (!is_object($node) || $node->type == 'private') {
    return FALSE;
  }
  else {
    return $node;
  }
}

=== hook_fuzzysearch_filter($op, $text) ===

Hook_fuzzysearch_filter($text) gives modules an opportunity to filter the the text
to be indexed before it is indexed and/or searched. The common use for this is
to do more complicated filtering than is allowed by the stop words text files.

$op == 'index' will filter words on indexing of content. $op == 'search' will 
filter the search terms before the index is searched for results.

=== About The Author ===

Drupal 6 version maintained by awolfey.

This module was created for Drupal as part of Google Summer of Code 2007 by
Blake Lucchesi www.boldsource.com blake@boldsource.com

File

readme.txt
View source
  1. Fuzzy Search Module Read Me Project Home: http://drupal.org/project/fuzzysearch
  2. === Installation ===
  3. To install this module simply:
  4. 1. Install as usual, see http://drupal.org/node/70151 for further information.
  5. 2. Configure permissions for which roles can: Search content, Administer the
  6. modules, View scoring information, and View debugging information.
  7. 3. At admin/build/block put the "Fuzzy search form" into a region in your theme.
  8. 4. Configure the module at /admin/settings/fuzzysearch.
  9. 5. Run cron until your site is 100% indexed.
  10. 6. Consider using a stopwords file to keep common words from bloating your
  11. index. See fuzzysearch/stopwords/README.txt
  12. 7. Set up a regular cron job to keep your site fully indexed.
  13. === What is indexed? ===
  14. Currently this module indexes all filtered node content, taxonomy terms
  15. associated with the nodes, cck text fields associated with a node, comments left
  16. on the node, and any text being returned by the call to hook_nodeapi with the
  17. $op = 'update index'.
  18. === Fuzzysearch Index Settings ===
  19. --Any time you change an index setting you must reindex your site for
  20. the change to take effect.--
  21. * You can choose the ngram length. This is the size of the chunks words are
  22. broken into on indexing and searching. The default value is 3. The lower the
  23. value, the more results (and more noise) you will get. Also, a lower value will
  24. increase the size of your fuzzysearch_index table.
  25. * Nodes to index per cron run: Adjust this lower if php is timing out during
  26. cron and Fuzzysearch is the culprit.
  27. * HTML tag scoring settings: A score is assiged to each of the tags listed. You
  28. can adjust this to your preference. If, for example, links are especially
  29. important, raise the a tag score. If you don't want any extra importance for
  30. titles, set the h1 tag to 0.
  31. * Rebuild index: Check this to requeue all nodes for indexing without clearing
  32. the index.
  33. * Rebuild and clear index: Check this to requeue all nodes for indexing and
  34. clear the index. Searching will be incomplete until all nodes are once again
  35. indexed.
  36. === Fuzzysearch Search Settings ===
  37. Assume missing letters in search terms:
  38. A search term as entered by a user may be missing letters. If you want to search
  39. for longer words than the user has entered, you can increase this number. In
  40. English for example, you will need at least 1 if you want to return a plural
  41. search term ending with "s" from a singular word. 0 means the term will not
  42. return longer words in the results. 1 means a 4 letter search term will also
  43. check 5 letters words in the index.
  44. Assume extra letters in search terms:
  45. A search term as entered by a user may have extra letters. If you want to search
  46. for shorter words than the user has entered, you can increase this number. In
  47. English for example, you will need at least 1 if you want to return a singular
  48. word from a plural search term ending with "s". 0 means the term will not
  49. return shorter words in the results. 1 means a 5 letter search term will also
  50. check 4 letters words in the index.
  51. Minimum completeness:
  52. When indexed, each ngram is saved with the percentage of the word it belongs to.
  53. "app" is 33.33% of "apple" because it is one ngram out of three (app, ppl, ple).
  54. When searching, Fuzzy search lets you specify a minimum sum of percentages of
  55. the ngrams it finds in each node. A lower number lets in more noise. Enter a
  56. value between 0 and 100 to set the completeness required in the returned results.
  57. It is best to set this value 10 points below your ideal minimum percentage. So
  58. if you wanted to match results with at least 50% of the word matching, set this
  59. value to 40. The match is calculated per indexed word and not by the search
  60. phrase, ensuring that matches are relevant to the words in the phrase and not
  61. just all the letter combinations in the phrase.
  62. Also note that when a phrase matches more than a single word the completeness
  63. can be higher than 100%, this is because the completeness of each word is summed
  64. and then sorted as a measure of accuracy in the result set.
  65. * You can filter results output by node type. This does not affect search
  66. indexing, so you don't have to reindex if you change this.
  67. *Checking the "Display scoring" checkbox is helpful for debugging when you are
  68. trying to fine tune score modifiers. It will output completeness and score
  69. values under each of the returned results.
  70. === Fuzzysearch Display Settings ===
  71. * Search results path:
  72. Choose the search results path, for example: search/results. Do not use leading
  73. or trailing slashes. The path must be unique in your site.
  74. * Sort by score
  75. If selected, the results will be sorted by score first and completeness second,
  76. which can make tag scores even more important. The default is to sort by
  77. completeness first. You may want to try this if you find high scoring nodes
  78. being pushed down in the results below lower scoring nodes with higher
  79. completeness.
  80. *Checking the "Display debugging information" checkbox can also help you
  81. understand how Fuzzysearch queries the index. You'll see the query and the
  82. ngrams and also the regex used when highlighting misspelled or partial words.
  83. *Result excerpt length:
  84. Set the length of the displayed text excerpt surrounding a found search term.
  85. Applies per found term.
  86. Maximum result length:
  87. Set the maximum length of the displayed result. Set to 0 for unlimited length.
  88. Applies per result.
  89. Minimum spelling score:
  90. Fuzzysearch tries to highlight search terms that may be misspelled. You can set
  91. the minimum threshold, which is calculated as a ratio of ngram hits to misses in
  92. a term. 0 may cause a misspelling to highlight everything, and 100 will only
  93. highlight exact terms. Enter value between 0 and 100.
  94. * Fuzzy Search will try to highlight misspelled words. You can
  95. adjust the accuracy by setting a minimum spelling score, which is calculated as
  96. a ratio of ngram hits to misses in a term, from 0 to 100, where 100 means no
  97. misspellings are highlighted.
  98. This works by replacing bad (misspelled, missing letters, extra letters) ngrams
  99. with a wildcard. It is possible to get false matches. For example, searching for
  100. "rendition" will also highlight "condition" if your spelling score is low
  101. enough. However, these kinds of matches are likely to have lower score
  102. completeness and be sorted to the bottom of your results if your search term
  103. exists in your content.
  104. === Fuzzy Search Blocks ===
  105. * Fuzzy search form: This block provides the form where users will entier the
  106. search terms.
  107. * Fuzzy search title query: The Drupal 6 version provides a block that performs a
  108. fuzzysearch on a query in the path. This may be performance intensive, so use
  109. with caution. If the fuzzysearch query is in the path, the block will return
  110. search matches of node titles. It's up to you to put the query in your path like
  111. this:
  112. http://example.com/node/add/question?fuzzysearch=arthritis%20knees%20pain
  113. This would be good for similar content blocks, or to suggest existing content
  114. before letting the user create new content.
  115. === Theming ===
  116. The module provides a template, fuzzysearch-result.tpl.php, that
  117. you can copy to your theme folder and modify. This affects the search results
  118. page. There are some theme functions you can override to theme the fuzzysearch
  119. block, and you can also override block.tpl.php.
  120. === About Fuzzy Search ===
  121. This module provides a fuzzy matching search engine for nodes.
  122. Nodes are indexed when the site's cron job is run. The module automatically
  123. queues a node for indexing once it is submitted, updated or a comment has been
  124. made on it. Nodes can also be queued for reindexing by other modules when the
  125. function fuzzysearch_reindex($nid, $module) is called, Where $nid is the nid of
  126. the node to have reindexed and $module is a string containing an identifier of
  127. the module calling for the node to be reindexed.
  128. Fuzzy matching is implemented by using qgrams. Each word in a node is split
  129. into 3 (default) letter lengths, so 'apple' gets indexed with 3 smaller strings
  130. 'app', 'ppl', 'ple'. The effect of this is that as long as your search matches
  131. X percentage (administerable in the admin settings) of the word the node will be
  132. pulled up in the results. One issue that is inherent with this method is cases
  133. when a user searches for a word like 'athens' which contains the word 'the'
  134. within it and has a completeness of 100%. In order to account for this
  135. larger length words qgrams must match qgrams from words with a similar length.
  136. This is an imperfect solution but it does a good job of returning the most
  137. relevant results.
  138. === Fuzzysearch Submodules ===
  139. Fuzzysearch comes with the following example submodules:
  140. 1. fuzzysearch_filter_example
  141. When enabled, this module provides an example of how to use
  142. hook_fuzzysearch_filter().
  143. See the API section below for information about this hook.
  144. === Fuzzysearch API ===
  145. === hook_fuzzysearch_score($op, $node) ===
  146. This hook allows other contributed modules to modify the score of any
  147. node being indexed. This affects nodes, not words. Site administrators can then
  148. set how important these modifiers are to their particular site's use. Changing
  149. the modifier score to 0 means that the modification being returned by that
  150. particular module will have no effect on the scoring of the nodes on the site.
  151. Setting the modifier to 10 means it will have maximum effect.
  152. This simple example code from a contributed module implementing the scoring hook
  153. returns a score multiplier of 5 if the node author is user 1. Any time a node is
  154. changed fuzzysearch will apply the modifiers on the next cron run. You must
  155. reindex your site to affect existing nodes, or resave the nodes.
  156. /**
  157. * Implementation of hook_fuzzysearch_score
  158. * @param $op 'settings' returns array with information about the module (seen
  159. * in the admin settings form) 'index' returns a score modifier to the node
  160. * being indexed.
  161. */
  162. function custom_fuzzysearch_score($op, $node) {
  163. switch ($op) {
  164. case 'settings':
  165. $info[] = array(
  166. 'id' => 'user_1',
  167. 'title' => t('Author is User 1'),
  168. 'description' => t('This multiplier lets you increase the score of nodes authored by user 1.'),
  169. );
  170. return $info;
  171. break;
  172. case 'index':
  173. $score = $node->uid == 1 ? 5 : 0;
  174. $scores[] = array(
  175. 'id' => 'user_1',
  176. 'score' => $score,
  177. );
  178. return $scores;
  179. }
  180. }
  181. === hook_fuzzysearch_index($node) ===
  182. Before fuzzysearch indexes a node, other modules have the chance to change the
  183. node or prevent it from being indexed. If a module implements this hook and
  184. returns FALSE the node will not be indexed. If it already existed in the index
  185. it will be removed. Modules should check that they have a node object to work
  186. with, as another module may have already returned FALSE.
  187. Changes returned to the node object are only reflected in the fuzzysearch index,
  188. not in the node as saved in the database.
  189. Some uses for this hook include preventing a node type from being indexed,
  190. boosting a node's search score for certain words, or adding additional text to
  191. a node. This is slightly different than hook_nodeapi's update index operation in
  192. that you can replace or change parts of the node rather than just adding text.
  193. Example code of a contributed module implementing the indexing hook:
  194. // Prevent private nodes from being indexed by fuzzy search.
  195. function custom_fuzzysearch_index($node) {
  196. if (!is_object($node) || $node->type == 'private') {
  197. return FALSE;
  198. }
  199. else {
  200. return $node;
  201. }
  202. }
  203. === hook_fuzzysearch_filter($op, $text) ===
  204. Hook_fuzzysearch_filter($text) gives modules an opportunity to filter the the text
  205. to be indexed before it is indexed and/or searched. The common use for this is
  206. to do more complicated filtering than is allowed by the stop words text files.
  207. $op == 'index' will filter words on indexing of content. $op == 'search' will
  208. filter the search terms before the index is searched for results.
  209. === About The Author ===
  210. Drupal 6 version maintained by awolfey.
  211. This module was created for Drupal as part of Google Summer of Code 2007 by
  212. Blake Lucchesi www.boldsource.com blake@boldsource.com