You are here

README.txt in Search by Page 8

Same filename and directory in other branches
  1. 6 README.txt
  2. 7 README.txt
Search by Page module for Drupal

This module adds searching to the core Drupal search module that is oriented by
page. It can be used as an additional tab in the core Search page, or you can
display Search by Page separately (though it requires the core Search module to
be enabled, because it uses Search for its indexing).

Contents of this file:
- Introduction
- How it works
- Requirements and Setup
- Installation and Usage
- Configuration
- Theming
- Users and roles
- Other suggestions


-- How it works --

The core Search module works by indexing your content whenever cron is run, and
then looking in that index when someone requests a search on your site. The
Content tab of Search indexes all the content items on your site managed
by the core Node module, by loading the content item and indexing the resulting
content (including the body and other fields, as well as comments, following the
display settings on the content type). The User portion actually doesn't
index users -- at search time, it just looks for a user name that matches (User
Search doesn't look in any other profile fields, except that for an
administrative user doing a user search, it looks in the email field as well).
Other modules may add other tabs to Search as well.

Search by Page, in contrast, indexes the content of pages on your site, which
could be content pages, user profile pages, composites of content (such as
Views), or pages that are generated by other modules. The indexing in Search by
Page is done by first building the "content" region of each page to be indexed,
in the language(s) appropriate to the page and with the viewpoint of the role
you configure, and then adding that content to the Search index. Note that only
the "content" region of the page is indexed, not the sidebars, header, footer,
or other block regions in your theme. Also note that what is indexed is what is
output by your theme for each page, in contrast to core Search, which does not
depend on the theme's rendering of the page.

Search by Page also restricts search results to the currently-enabled language.
The core Search module only does this for content search, and only if you have
the Internationalization module enabled.

One other difference between Search by Page and the usual content search of the
core Search module is in reindexing. Search by Page assumes that your page
content might change over time, so it periodically reindexes the pages on your
site, giving priority to pages that have been edited. In contrast, the core
Search module assumes that if a content item hasn't been edited, it doesn't need
to be reindexed. You have some additional control over this reindexing -- see
the configuration section below.

Your site may experience errors during content indexing, which you can see in
the Recent Log Entries report. The typical reason is that the item that is
being indexed cannot be viewed by the user role you chose for indexing; other
errors are also possible. If this happens, Search by Page will still mark the
page as "indexed" in the search index, so that in the next cron run, it will
not try to index the page again and block working items from indexing. If you
ever want Search by Page to try indexing the failed pages again (after fixing
the cause of the error, presumably), there is a link to reset items with no
content in the index. This is located in the "Additional Actions" section of
the Search by Page configuration screen.


-- REQUIREMENTS and SETUP --

Search by Page does not know what the pages of your site are, so it doesn't
index anything by itself. You will need to enable and configure at least one
sub-module that lets you add paths to the search index, in order for this module
to do anything.

You will also need to set up one or more search "environments". Each environment
defines which paths are searchable, and has its own search URL and search block.

You will also need to make sure that Search by Page is enabled on the main
Search configuration page.

Four sub-modules are provided: "Paths" to index arbitrary paths to pages on your
site, "Nodes" to index content items of particular content types, "Users" to
index user profile pages for users of particular roles, and "Attachments" to
index files attached to content items.

The "Paths" sub-module is the most generic, but if you put a lot of paths in it,
your searches will run slower. (The technical reason is that each time someone
searches on your site, this module has to check whether that person has
permission to view each page in the list, to exclude pages the person doesn't
have permission to view from search results, and this has to be done via a PHP
loop rather than an SQL query because of how Drupal permissions work.)

IMPORTANT NOTE: If you are using Search by Page Paths, your database must be
set up with permission to create temporary tables.

The "Attachments" sub-module indexes the text in certain types of files that are
attached to content items via the core File field. This requires "helper"
programs to extract the text from file attachments, and the helper programs are
configured using the separate Search Files module (which you can find at
http://drupal.org/project/search_files). It is recommended that you only enable
the Search Files API module (and not the other included modules). This will
enable just the helper program setup functionality, without enabling the other
functionality of Search Files.

If you want to write your own sub-modules, see the search_by_page.api.php file
included with this module (or use one of the included sub-modules as an
example).

Once you have enabled sub-module(s), visit the path
admin/config/search/search_by_page to set up search environments and define
pages to index for each environment. Then wait for cron to run (or visit the
status report page, admin/reports/status, and click on "run cron manually"). No
pages will be indexed until cron has run, and no search results will come out
until pages have been indexed.

Other configuration options:
* You can change various labels and other text on the Search by Page
configuration pages.
* You can set the number of items Search by Page will index per cron run on
the Search by Page configuration pages. This is independent of the indexing
settings for the core Search module. If you are using Search by Page as an
independent search (rather than as a tab on the core Search page -- see section
below), you might want to set the core Search settings cron limit to zero, and
turn off searching for the core Node and User modules (on the core Search
Settings page), so that only Search by Page items are added to the search index.
* You can control the reindex cycling described in the How it Works section
above by using the minimum/maximum reindexing time settings, which are on a
per-module, per-environment basis. Setting the minimum reindexing time forces
Search by Page to wait at least this amount of time before reindexing that type
of page.  Setting the maximum reindex time forces Search by Page to reindex that
type of page immediately when this amount of time has passed. WARNING: Do not
choose too small of a maximum reindex page globally! This setting works by
marking the pages for immediate reindexing when this time has passed, and it can
interfere with the reindexing of new content.
* You can exclude the contents of specific HTML tags from indexing.
* You will also need to set permissions, which are separate from the core Search
permissions.
* You should also visit the main Search configuration screen, where you can set
options such as the number of items to index each cron run for core Search
modules, and the minimum word size for searching. You can also watch the
progress of indexing on that page (there is a detailed table near the bottom,
in the Search by Page section).


-- INSTALLATION and USAGE --

You have two choices for how to use the module, once you have it set up:

a) There will be a new tab (called "Pages" by default), included in Drupal's
built-in search. If a site visitor performs a search from that tab, they will
get the Search by Pages results. This will use whichever search environment you
have set as the default, and may be useful if you just want to use Search by
Page to add a few pages or files to Drupal's existing search functionality.

b) You can also use Search by Page as its own entity. You will need to set up
your search environment(s) so that all the content you want to search is
available. To run Search by Page as its own entity, enable the Search by Pages
blocks for your search environments, and/or add a link to the paths you have
defined for your search environments to your navigation menu system. You will
also need to make sure that Search by Page is an enabled search module on the
Search configuration page (admin/config/search/settings).


-- CONFIGURATION --

-- Theming --

The search form that is used by Search by Page on search pages and search blocks
can be themed using the search-by-page-form.tpl.php file provided (copy that
file into your theme and modify it).

Search results are themed using the search-result.tpl.php (each result item) and
search-results.tpl.php (the list of results) theme files from the core Search
module (in directory modules/search in your Drupal installation). If you are
using Search by Page Attachments, there is an additional variable available
$result['related_node'], which gives you the node object that the attachment is
attached to.


-- Users and roles --

When you set up content items, attachments, etc. for searching within Search by
Page, you will need to choose a role to use for search indexing. This will make
Search by Page render your pages from the point of view of a user with that
role.

In order to do this, assuming you have used a non-anonymous role, Search by Page
will create its own user accounts for internal use, which you will see on your
Users management page. For instance, if you set up Search by Page Nodes to index
from the point of view of role "My role", Search by Page will set up a user
called "sbp indexing My role" with role "My role". The users that Search by Page
sets up will always have their status set to "blocked". During search indexing,
the account is set to "active" only temporarily, and only for the indexing
process, so no one should ever be able to see these users except site
administrators.


-- Other suggestions --

1. Stemming and Exact Matches

The default behavior for Drupal's core Search module (which is the technology
used for indexing/searching in Search by Page) is that only exact matches are
returned (except for the User search portion of core Search, which matches
substrings of user names). For instance, this means that if you search for
"quake", and a page contains "quakes", "quaking", or "earthquake", it will not
be matched.

To get around this limitation, I suggest using a "stemmer" module, such as
http://drupal.org/project/porterstemmer (You can search for "stemmer" on
drupal.org to find stemmers for other languages.)  Stemmers enable matching on
inflected forms of words (verb forms, plurals, etc.), so they should give you
matches for "quaking" and "quakes" if you search for "quake". They wouldn't give
you a match for "earthquake", however.

2. Context module

If you are using the Context module to place blocks in the content region of
pages, and these blocks are different for different pages on your site, then you
will likely have problems with Search by Page search indexing. Search by Page
indexes all of the content (including blocks) that is in the content region of
your page. This would normally be OK, but the problem is that the Context module
does a lot of PHP variable local static caching during each page load
request. During normal site operation, this is fine because after each single
page request, the PHP local variables are cleared out, so they will be
recalculated for the next page load. But during a cron run for search indexing,
the PHP local variables are not cleared out until the end of the cron run, so
the blocks that the Context module calculates should be present on the first
page that is indexed will be used for all of the pages instead of being
recalculated for the other pages.

The solution to this problem is that the core Search module and also the Search
by Page module allow you to specify how many pages can be indexed per cron
run. If you set this value to 1, then there will be no chance that you'll have
problems with the wrong block context being used for subsequent pages that are
indexed.

File

README.txt
View source
  1. Search by Page module for Drupal
  2. This module adds searching to the core Drupal search module that is oriented by
  3. page. It can be used as an additional tab in the core Search page, or you can
  4. display Search by Page separately (though it requires the core Search module to
  5. be enabled, because it uses Search for its indexing).
  6. Contents of this file:
  7. - Introduction
  8. - How it works
  9. - Requirements and Setup
  10. - Installation and Usage
  11. - Configuration
  12. - Theming
  13. - Users and roles
  14. - Other suggestions
  15. -- How it works --
  16. The core Search module works by indexing your content whenever cron is run, and
  17. then looking in that index when someone requests a search on your site. The
  18. Content tab of Search indexes all the content items on your site managed
  19. by the core Node module, by loading the content item and indexing the resulting
  20. content (including the body and other fields, as well as comments, following the
  21. display settings on the content type). The User portion actually doesn't
  22. index users -- at search time, it just looks for a user name that matches (User
  23. Search doesn't look in any other profile fields, except that for an
  24. administrative user doing a user search, it looks in the email field as well).
  25. Other modules may add other tabs to Search as well.
  26. Search by Page, in contrast, indexes the content of pages on your site, which
  27. could be content pages, user profile pages, composites of content (such as
  28. Views), or pages that are generated by other modules. The indexing in Search by
  29. Page is done by first building the "content" region of each page to be indexed,
  30. in the language(s) appropriate to the page and with the viewpoint of the role
  31. you configure, and then adding that content to the Search index. Note that only
  32. the "content" region of the page is indexed, not the sidebars, header, footer,
  33. or other block regions in your theme. Also note that what is indexed is what is
  34. output by your theme for each page, in contrast to core Search, which does not
  35. depend on the theme's rendering of the page.
  36. Search by Page also restricts search results to the currently-enabled language.
  37. The core Search module only does this for content search, and only if you have
  38. the Internationalization module enabled.
  39. One other difference between Search by Page and the usual content search of the
  40. core Search module is in reindexing. Search by Page assumes that your page
  41. content might change over time, so it periodically reindexes the pages on your
  42. site, giving priority to pages that have been edited. In contrast, the core
  43. Search module assumes that if a content item hasn't been edited, it doesn't need
  44. to be reindexed. You have some additional control over this reindexing -- see
  45. the configuration section below.
  46. Your site may experience errors during content indexing, which you can see in
  47. the Recent Log Entries report. The typical reason is that the item that is
  48. being indexed cannot be viewed by the user role you chose for indexing; other
  49. errors are also possible. If this happens, Search by Page will still mark the
  50. page as "indexed" in the search index, so that in the next cron run, it will
  51. not try to index the page again and block working items from indexing. If you
  52. ever want Search by Page to try indexing the failed pages again (after fixing
  53. the cause of the error, presumably), there is a link to reset items with no
  54. content in the index. This is located in the "Additional Actions" section of
  55. the Search by Page configuration screen.
  56. -- REQUIREMENTS and SETUP --
  57. Search by Page does not know what the pages of your site are, so it doesn't
  58. index anything by itself. You will need to enable and configure at least one
  59. sub-module that lets you add paths to the search index, in order for this module
  60. to do anything.
  61. You will also need to set up one or more search "environments". Each environment
  62. defines which paths are searchable, and has its own search URL and search block.
  63. You will also need to make sure that Search by Page is enabled on the main
  64. Search configuration page.
  65. Four sub-modules are provided: "Paths" to index arbitrary paths to pages on your
  66. site, "Nodes" to index content items of particular content types, "Users" to
  67. index user profile pages for users of particular roles, and "Attachments" to
  68. index files attached to content items.
  69. The "Paths" sub-module is the most generic, but if you put a lot of paths in it,
  70. your searches will run slower. (The technical reason is that each time someone
  71. searches on your site, this module has to check whether that person has
  72. permission to view each page in the list, to exclude pages the person doesn't
  73. have permission to view from search results, and this has to be done via a PHP
  74. loop rather than an SQL query because of how Drupal permissions work.)
  75. IMPORTANT NOTE: If you are using Search by Page Paths, your database must be
  76. set up with permission to create temporary tables.
  77. The "Attachments" sub-module indexes the text in certain types of files that are
  78. attached to content items via the core File field. This requires "helper"
  79. programs to extract the text from file attachments, and the helper programs are
  80. configured using the separate Search Files module (which you can find at
  81. http://drupal.org/project/search_files). It is recommended that you only enable
  82. the Search Files API module (and not the other included modules). This will
  83. enable just the helper program setup functionality, without enabling the other
  84. functionality of Search Files.
  85. If you want to write your own sub-modules, see the search_by_page.api.php file
  86. included with this module (or use one of the included sub-modules as an
  87. example).
  88. Once you have enabled sub-module(s), visit the path
  89. admin/config/search/search_by_page to set up search environments and define
  90. pages to index for each environment. Then wait for cron to run (or visit the
  91. status report page, admin/reports/status, and click on "run cron manually"). No
  92. pages will be indexed until cron has run, and no search results will come out
  93. until pages have been indexed.
  94. Other configuration options:
  95. * You can change various labels and other text on the Search by Page
  96. configuration pages.
  97. * You can set the number of items Search by Page will index per cron run on
  98. the Search by Page configuration pages. This is independent of the indexing
  99. settings for the core Search module. If you are using Search by Page as an
  100. independent search (rather than as a tab on the core Search page -- see section
  101. below), you might want to set the core Search settings cron limit to zero, and
  102. turn off searching for the core Node and User modules (on the core Search
  103. Settings page), so that only Search by Page items are added to the search index.
  104. * You can control the reindex cycling described in the How it Works section
  105. above by using the minimum/maximum reindexing time settings, which are on a
  106. per-module, per-environment basis. Setting the minimum reindexing time forces
  107. Search by Page to wait at least this amount of time before reindexing that type
  108. of page. Setting the maximum reindex time forces Search by Page to reindex that
  109. type of page immediately when this amount of time has passed. WARNING: Do not
  110. choose too small of a maximum reindex page globally! This setting works by
  111. marking the pages for immediate reindexing when this time has passed, and it can
  112. interfere with the reindexing of new content.
  113. * You can exclude the contents of specific HTML tags from indexing.
  114. * You will also need to set permissions, which are separate from the core Search
  115. permissions.
  116. * You should also visit the main Search configuration screen, where you can set
  117. options such as the number of items to index each cron run for core Search
  118. modules, and the minimum word size for searching. You can also watch the
  119. progress of indexing on that page (there is a detailed table near the bottom,
  120. in the Search by Page section).
  121. -- INSTALLATION and USAGE --
  122. You have two choices for how to use the module, once you have it set up:
  123. a) There will be a new tab (called "Pages" by default), included in Drupal's
  124. built-in search. If a site visitor performs a search from that tab, they will
  125. get the Search by Pages results. This will use whichever search environment you
  126. have set as the default, and may be useful if you just want to use Search by
  127. Page to add a few pages or files to Drupal's existing search functionality.
  128. b) You can also use Search by Page as its own entity. You will need to set up
  129. your search environment(s) so that all the content you want to search is
  130. available. To run Search by Page as its own entity, enable the Search by Pages
  131. blocks for your search environments, and/or add a link to the paths you have
  132. defined for your search environments to your navigation menu system. You will
  133. also need to make sure that Search by Page is an enabled search module on the
  134. Search configuration page (admin/config/search/settings).
  135. -- CONFIGURATION --
  136. -- Theming --
  137. The search form that is used by Search by Page on search pages and search blocks
  138. can be themed using the search-by-page-form.tpl.php file provided (copy that
  139. file into your theme and modify it).
  140. Search results are themed using the search-result.tpl.php (each result item) and
  141. search-results.tpl.php (the list of results) theme files from the core Search
  142. module (in directory modules/search in your Drupal installation). If you are
  143. using Search by Page Attachments, there is an additional variable available
  144. $result['related_node'], which gives you the node object that the attachment is
  145. attached to.
  146. -- Users and roles --
  147. When you set up content items, attachments, etc. for searching within Search by
  148. Page, you will need to choose a role to use for search indexing. This will make
  149. Search by Page render your pages from the point of view of a user with that
  150. role.
  151. In order to do this, assuming you have used a non-anonymous role, Search by Page
  152. will create its own user accounts for internal use, which you will see on your
  153. Users management page. For instance, if you set up Search by Page Nodes to index
  154. from the point of view of role "My role", Search by Page will set up a user
  155. called "sbp indexing My role" with role "My role". The users that Search by Page
  156. sets up will always have their status set to "blocked". During search indexing,
  157. the account is set to "active" only temporarily, and only for the indexing
  158. process, so no one should ever be able to see these users except site
  159. administrators.
  160. -- Other suggestions --
  161. 1. Stemming and Exact Matches
  162. The default behavior for Drupal's core Search module (which is the technology
  163. used for indexing/searching in Search by Page) is that only exact matches are
  164. returned (except for the User search portion of core Search, which matches
  165. substrings of user names). For instance, this means that if you search for
  166. "quake", and a page contains "quakes", "quaking", or "earthquake", it will not
  167. be matched.
  168. To get around this limitation, I suggest using a "stemmer" module, such as
  169. http://drupal.org/project/porterstemmer (You can search for "stemmer" on
  170. drupal.org to find stemmers for other languages.) Stemmers enable matching on
  171. inflected forms of words (verb forms, plurals, etc.), so they should give you
  172. matches for "quaking" and "quakes" if you search for "quake". They wouldn't give
  173. you a match for "earthquake", however.
  174. 2. Context module
  175. If you are using the Context module to place blocks in the content region of
  176. pages, and these blocks are different for different pages on your site, then you
  177. will likely have problems with Search by Page search indexing. Search by Page
  178. indexes all of the content (including blocks) that is in the content region of
  179. your page. This would normally be OK, but the problem is that the Context module
  180. does a lot of PHP variable local static caching during each page load
  181. request. During normal site operation, this is fine because after each single
  182. page request, the PHP local variables are cleared out, so they will be
  183. recalculated for the next page load. But during a cron run for search indexing,
  184. the PHP local variables are not cleared out until the end of the cron run, so
  185. the blocks that the Context module calculates should be present on the first
  186. page that is indexed will be used for all of the pages instead of being
  187. recalculated for the other pages.
  188. The solution to this problem is that the core Search module and also the Search
  189. by Page module allow you to specify how many pages can be indexed per cron
  190. run. If you set this value to 1, then there will be no chance that you'll have
  191. problems with the wrong block context being used for subsequent pages that are
  192. indexed.