You are here

README.txt in Search by Page 7

Same filename and directory in other branches
  1. 8 README.txt
  2. 6 README.txt
Search by Page module for Drupal

This module adds searching to the core Drupal search module that is oriented by
page. It can be used as an additional tab in the core Search page, or you can
display Search by Page separately (though it requires the core Search module to
be enabled, because it uses Search for its indexing).

Contents of this file:
- How it works
- Setup and Configuration
- Usage
- Theming
- Users and roles
- Other suggestions


-- How it works --

The core Search module works by indexing your content whenever cron is run, and
then looking in that index when someone requests a search on your site. The
Content tab of Search indexes all the content items on your site managed
by the core Node module, by loading the content item and indexing the resulting
content (including the body and other fields, as well as comments, following the
display settings on the content type). The User portion actually doesn't
index users -- at search time, it just looks for a user name that matches (User
Search doesn't look in any other profile fields, except that for an
administrative user doing a user search, it looks in the email field as well).
Other modules may add other tabs to Search as well.

Search by Page, in contrast, indexes the content of pages on your site, which
could be content pages, user profile pages, composites of content (such as
Views), or pages that are generated by other modules. The indexing in Search by
Page is done by first building the "content" region of each page to be indexed,
in the language(s) appropriate to the page and with the viewpoint of the role
you configure, and then adding that content to the Search index. Note that only
the "content" region of the page is indexed, not the sidebars, header, footer,
or other block regions in your theme. Also note that what is indexed is what is
output by your theme for each page, in contrast to core Search, which does not
depend on the theme's rendering of the page.

Search by Page also restricts search results to the currently-enabled language.
The core Search module only does this for content search, and only if you have
the Internationalization module enabled.

One other difference between Search by Page and the usual content search of the
core Search module is in reindexing. Search by Page assumes that your page
content might change over time, so it periodically reindexes the pages on your
site, giving priority to pages that have been edited. In contrast, the core
Search module assumes that if a content item hasn't been edited, it doesn't need
to be reindexed. You have some additional control over this reindexing -- see
the configuration section below.

Your site may experience errors during content indexing, which you can see in
the Recent Log Entries report. The typical reason is that the item that is
being indexed cannot be viewed by the user role you chose for indexing; other
errors are also possible. If this happens, Search by Page will still mark the
page as "indexed" in the search index, so that in the next cron run, it will
not try to index the page again and block working items from indexing. If you
ever want Search by Page to try indexing the failed pages again (after fixing
the cause of the error, presumably), there is a link to reset items with no
content in the index. This is located in the "Additional Actions" section of
the Search by Page configuration screen.


-- Setup and Configuration --

Search by Page does not know what the pages of your site are, so it doesn't
index anything by itself. You will need to enable and configure at least one
sub-module that lets you add paths to the search index, in order for this module
to do anything.

You will also need to set up one or more search "environments". Each environment
defines which paths are searchable, and has its own search URL and search block.

You will also need to make sure that Search by Page is enabled on the main
Search configuration page.

Four sub-modules are provided: "Paths" to index arbitrary paths to pages on your
site, "Nodes" to index content items of particular content types, "Users" to
index user profile pages for users of particular roles, and "Attachments" to
index files attached to content items.

The "Paths" sub-module is the most generic, but if you put a lot of paths in it,
your searches will run slower. (The technical reason is that each time someone
searches on your site, this module has to check whether that person has
permission to view each page in the list, to exclude pages the person doesn't
have permission to view from search results, and this has to be done via a PHP
loop rather than an SQL query because of how Drupal permissions work.)

IMPORTANT NOTE: If you are using Search by Page Paths, your database must be
set up with permission to create temporary tables.

The "Attachments" sub-module indexes the text in certain types of files that are
attached to content items via the core File field. This requires "helper"
programs to extract the text from file attachments, and the helper programs are
configured using the separate Search Files module (which you can find at
http://drupal.org/project/search_files). It is recommended that you only enable
the Search Files API module (and not the other included modules). This will
enable just the helper program setup functionality, without enabling the other
functionality of Search Files.

If you want to write your own sub-modules, see the search_by_page.api.php file
included with this module (or use one of the included sub-modules as an
example).

Once you have enabled sub-module(s), visit the path
admin/config/search/search_by_page to set up search environments and define
pages to index for each environment. Then wait for cron to run (or visit the
status report page, admin/reports/status, and click on "run cron manually"). No
pages will be indexed until cron has run, and no search results will come out
until pages have been indexed.

Other configuration options:
* You can change various labels and other text on the Search by Page
configuration pages.
* You can set the number of items Search by Page will index per cron run on
the Search by Page configuration pages. This is independent of the indexing
settings for the core Search module. If you are using Search by Page as an
independent search (rather than as a tab on the core Search page -- see section
below), you might want to set the core Search settings cron limit to zero, and
turn off searching for the core Node and User modules (on the core Search
Settings page), so that only Search by Page items are added to the search index.
* You can control the reindex cycling described in the How it Works section
above by using the minimum/maximum reindexing time settings, which are on a
per-module, per-environment basis. Setting the minimum reindexing time forces
Search by Page to wait at least this amount of time before reindexing that type
of page.  Setting the maximum reindex time forces Search by Page to reindex that
type of page immediately when this amount of time has passed. WARNING: Do not
choose too small of a maximum reindex page globally! This setting works by
marking the pages for immediate reindexing when this time has passed, and it can
interfere with the reindexing of new content.
* You can exclude the contents of specific HTML tags from indexing.
* You will also need to set permissions, which are separate from the core Search
permissions.
* You should also visit the main Search configuration screen, where you can set
options such as the number of items to index each cron run for core Search
modules, and the minimum word size for searching. You can also watch the
progress of indexing on that page (there is a detailed table near the bottom,
in the Search by Page section).


-- Usage --

You have two choices for how to use the module, once you have it set up:

a) There will be a new tab (called "Pages" by default), included in Drupal's
built-in search. If a site visitor performs a search from that tab, they will
get the Search by Pages results. This will use whichever search environment you
have set as the default, and may be useful if you just want to use Search by
Page to add a few pages or files to Drupal's existing search functionality.

b) You can also use Search by Page as its own entity. You will need to set up
your search environment(s) so that all the content you want to search is
available. To run Search by Page as its own entity, enable the Search by Pages
blocks for your search environments, and/or add a link to the paths you have
defined for your search environments to your navigation menu system. You will
also need to make sure that Search by Page is an enabled search module on the
Search configuration page (admin/config/search/settings).


-- Theming --

The search form that is used by Search by Page on search pages and search blocks
can be themed using the search-by-page-form.tpl.php file provided (copy that
file into your theme and modify it).

Search results are themed using the search-result.tpl.php (each result item) and
search-results.tpl.php (the list of results) theme files from the core Search
module (in directory modules/search in your Drupal installation). If you are
using Search by Page Attachments, there is an additional variable available
$result['related_node'], which gives you the node object that the attachment is
attached to.


-- Users and roles --

When you set up content items, attachments, etc. for searching within Search by
Page, you will need to choose a role to use for search indexing. This will make
Search by Page render your pages from the point of view of a user with that
role.

In order to do this, assuming you have used a non-anonymous role, Search by Page
will create its own user accounts for internal use, which you will see on your
Users management page. For instance, if you set up Search by Page Nodes to index
from the point of view of role "My role", Search by Page will set up a user
called "sbp indexing My role" with role "My role". The users that Search by Page
sets up will always have their status set to "blocked". During search indexing,
the account is set to "active" only temporarily, and only for the indexing
process, so no one should ever be able to see these users except site
administrators.


-- Other suggestions --

1. Stemming and Exact Matches

The default behavior for Drupal's core Search module (which is the technology
used for indexing/searching in Search by Page) is that only exact matches are
returned (except for the User search portion of core Search, which matches
substrings of user names). For instance, this means that if you search for
"quake", and a page contains "quakes", "quaking", or "earthquake", it will not
be matched.

To get around this limitation, I suggest using a "stemmer" module, such as
http://drupal.org/project/porterstemmer (You can search for "stemmer" on
drupal.org to find stemmers for other languages.)  Stemmers enable matching on
inflected forms of words (verb forms, plurals, etc.), so they should give you
matches for "quaking" and "quakes" if you search for "quake". They wouldn't give
you a match for "earthquake", however.

2. Context module

If you are using the Context module to place blocks in the content region of
pages, and these blocks are different for different pages on your site, then you
will likely have problems with Search by Page search indexing. Search by Page
indexes all of the content (including blocks) that is in the content region of
your page. This would normally be OK, but the problem is that the Context module
does a lot of PHP variable local static caching during each page load
request. During normal site operation, this is fine because after each single
page request, the PHP local variables are cleared out, so they will be
recalculated for the next page load. But during a cron run for search indexing,
the PHP local variables are not cleared out until the end of the cron run, so
the blocks that the Context module calculates should be present on the first
page that is indexed will be used for all of the pages instead of being
recalculated for the other pages.

The solution to this problem is that the core Search module and also the Search
by Page module allow you to specify how many pages can be indexed per cron
run. If you set this value to 1, then there will be no chance that you'll have
problems with the wrong block context being used for subsequent pages that are
indexed.

File

README.txt
View source
  1. Search by Page module for Drupal
  2. This module adds searching to the core Drupal search module that is oriented by
  3. page. It can be used as an additional tab in the core Search page, or you can
  4. display Search by Page separately (though it requires the core Search module to
  5. be enabled, because it uses Search for its indexing).
  6. Contents of this file:
  7. - How it works
  8. - Setup and Configuration
  9. - Usage
  10. - Theming
  11. - Users and roles
  12. - Other suggestions
  13. -- How it works --
  14. The core Search module works by indexing your content whenever cron is run, and
  15. then looking in that index when someone requests a search on your site. The
  16. Content tab of Search indexes all the content items on your site managed
  17. by the core Node module, by loading the content item and indexing the resulting
  18. content (including the body and other fields, as well as comments, following the
  19. display settings on the content type). The User portion actually doesn't
  20. index users -- at search time, it just looks for a user name that matches (User
  21. Search doesn't look in any other profile fields, except that for an
  22. administrative user doing a user search, it looks in the email field as well).
  23. Other modules may add other tabs to Search as well.
  24. Search by Page, in contrast, indexes the content of pages on your site, which
  25. could be content pages, user profile pages, composites of content (such as
  26. Views), or pages that are generated by other modules. The indexing in Search by
  27. Page is done by first building the "content" region of each page to be indexed,
  28. in the language(s) appropriate to the page and with the viewpoint of the role
  29. you configure, and then adding that content to the Search index. Note that only
  30. the "content" region of the page is indexed, not the sidebars, header, footer,
  31. or other block regions in your theme. Also note that what is indexed is what is
  32. output by your theme for each page, in contrast to core Search, which does not
  33. depend on the theme's rendering of the page.
  34. Search by Page also restricts search results to the currently-enabled language.
  35. The core Search module only does this for content search, and only if you have
  36. the Internationalization module enabled.
  37. One other difference between Search by Page and the usual content search of the
  38. core Search module is in reindexing. Search by Page assumes that your page
  39. content might change over time, so it periodically reindexes the pages on your
  40. site, giving priority to pages that have been edited. In contrast, the core
  41. Search module assumes that if a content item hasn't been edited, it doesn't need
  42. to be reindexed. You have some additional control over this reindexing -- see
  43. the configuration section below.
  44. Your site may experience errors during content indexing, which you can see in
  45. the Recent Log Entries report. The typical reason is that the item that is
  46. being indexed cannot be viewed by the user role you chose for indexing; other
  47. errors are also possible. If this happens, Search by Page will still mark the
  48. page as "indexed" in the search index, so that in the next cron run, it will
  49. not try to index the page again and block working items from indexing. If you
  50. ever want Search by Page to try indexing the failed pages again (after fixing
  51. the cause of the error, presumably), there is a link to reset items with no
  52. content in the index. This is located in the "Additional Actions" section of
  53. the Search by Page configuration screen.
  54. -- Setup and Configuration --
  55. Search by Page does not know what the pages of your site are, so it doesn't
  56. index anything by itself. You will need to enable and configure at least one
  57. sub-module that lets you add paths to the search index, in order for this module
  58. to do anything.
  59. You will also need to set up one or more search "environments". Each environment
  60. defines which paths are searchable, and has its own search URL and search block.
  61. You will also need to make sure that Search by Page is enabled on the main
  62. Search configuration page.
  63. Four sub-modules are provided: "Paths" to index arbitrary paths to pages on your
  64. site, "Nodes" to index content items of particular content types, "Users" to
  65. index user profile pages for users of particular roles, and "Attachments" to
  66. index files attached to content items.
  67. The "Paths" sub-module is the most generic, but if you put a lot of paths in it,
  68. your searches will run slower. (The technical reason is that each time someone
  69. searches on your site, this module has to check whether that person has
  70. permission to view each page in the list, to exclude pages the person doesn't
  71. have permission to view from search results, and this has to be done via a PHP
  72. loop rather than an SQL query because of how Drupal permissions work.)
  73. IMPORTANT NOTE: If you are using Search by Page Paths, your database must be
  74. set up with permission to create temporary tables.
  75. The "Attachments" sub-module indexes the text in certain types of files that are
  76. attached to content items via the core File field. This requires "helper"
  77. programs to extract the text from file attachments, and the helper programs are
  78. configured using the separate Search Files module (which you can find at
  79. http://drupal.org/project/search_files). It is recommended that you only enable
  80. the Search Files API module (and not the other included modules). This will
  81. enable just the helper program setup functionality, without enabling the other
  82. functionality of Search Files.
  83. If you want to write your own sub-modules, see the search_by_page.api.php file
  84. included with this module (or use one of the included sub-modules as an
  85. example).
  86. Once you have enabled sub-module(s), visit the path
  87. admin/config/search/search_by_page to set up search environments and define
  88. pages to index for each environment. Then wait for cron to run (or visit the
  89. status report page, admin/reports/status, and click on "run cron manually"). No
  90. pages will be indexed until cron has run, and no search results will come out
  91. until pages have been indexed.
  92. Other configuration options:
  93. * You can change various labels and other text on the Search by Page
  94. configuration pages.
  95. * You can set the number of items Search by Page will index per cron run on
  96. the Search by Page configuration pages. This is independent of the indexing
  97. settings for the core Search module. If you are using Search by Page as an
  98. independent search (rather than as a tab on the core Search page -- see section
  99. below), you might want to set the core Search settings cron limit to zero, and
  100. turn off searching for the core Node and User modules (on the core Search
  101. Settings page), so that only Search by Page items are added to the search index.
  102. * You can control the reindex cycling described in the How it Works section
  103. above by using the minimum/maximum reindexing time settings, which are on a
  104. per-module, per-environment basis. Setting the minimum reindexing time forces
  105. Search by Page to wait at least this amount of time before reindexing that type
  106. of page. Setting the maximum reindex time forces Search by Page to reindex that
  107. type of page immediately when this amount of time has passed. WARNING: Do not
  108. choose too small of a maximum reindex page globally! This setting works by
  109. marking the pages for immediate reindexing when this time has passed, and it can
  110. interfere with the reindexing of new content.
  111. * You can exclude the contents of specific HTML tags from indexing.
  112. * You will also need to set permissions, which are separate from the core Search
  113. permissions.
  114. * You should also visit the main Search configuration screen, where you can set
  115. options such as the number of items to index each cron run for core Search
  116. modules, and the minimum word size for searching. You can also watch the
  117. progress of indexing on that page (there is a detailed table near the bottom,
  118. in the Search by Page section).
  119. -- Usage --
  120. You have two choices for how to use the module, once you have it set up:
  121. a) There will be a new tab (called "Pages" by default), included in Drupal's
  122. built-in search. If a site visitor performs a search from that tab, they will
  123. get the Search by Pages results. This will use whichever search environment you
  124. have set as the default, and may be useful if you just want to use Search by
  125. Page to add a few pages or files to Drupal's existing search functionality.
  126. b) You can also use Search by Page as its own entity. You will need to set up
  127. your search environment(s) so that all the content you want to search is
  128. available. To run Search by Page as its own entity, enable the Search by Pages
  129. blocks for your search environments, and/or add a link to the paths you have
  130. defined for your search environments to your navigation menu system. You will
  131. also need to make sure that Search by Page is an enabled search module on the
  132. Search configuration page (admin/config/search/settings).
  133. -- Theming --
  134. The search form that is used by Search by Page on search pages and search blocks
  135. can be themed using the search-by-page-form.tpl.php file provided (copy that
  136. file into your theme and modify it).
  137. Search results are themed using the search-result.tpl.php (each result item) and
  138. search-results.tpl.php (the list of results) theme files from the core Search
  139. module (in directory modules/search in your Drupal installation). If you are
  140. using Search by Page Attachments, there is an additional variable available
  141. $result['related_node'], which gives you the node object that the attachment is
  142. attached to.
  143. -- Users and roles --
  144. When you set up content items, attachments, etc. for searching within Search by
  145. Page, you will need to choose a role to use for search indexing. This will make
  146. Search by Page render your pages from the point of view of a user with that
  147. role.
  148. In order to do this, assuming you have used a non-anonymous role, Search by Page
  149. will create its own user accounts for internal use, which you will see on your
  150. Users management page. For instance, if you set up Search by Page Nodes to index
  151. from the point of view of role "My role", Search by Page will set up a user
  152. called "sbp indexing My role" with role "My role". The users that Search by Page
  153. sets up will always have their status set to "blocked". During search indexing,
  154. the account is set to "active" only temporarily, and only for the indexing
  155. process, so no one should ever be able to see these users except site
  156. administrators.
  157. -- Other suggestions --
  158. 1. Stemming and Exact Matches
  159. The default behavior for Drupal's core Search module (which is the technology
  160. used for indexing/searching in Search by Page) is that only exact matches are
  161. returned (except for the User search portion of core Search, which matches
  162. substrings of user names). For instance, this means that if you search for
  163. "quake", and a page contains "quakes", "quaking", or "earthquake", it will not
  164. be matched.
  165. To get around this limitation, I suggest using a "stemmer" module, such as
  166. http://drupal.org/project/porterstemmer (You can search for "stemmer" on
  167. drupal.org to find stemmers for other languages.) Stemmers enable matching on
  168. inflected forms of words (verb forms, plurals, etc.), so they should give you
  169. matches for "quaking" and "quakes" if you search for "quake". They wouldn't give
  170. you a match for "earthquake", however.
  171. 2. Context module
  172. If you are using the Context module to place blocks in the content region of
  173. pages, and these blocks are different for different pages on your site, then you
  174. will likely have problems with Search by Page search indexing. Search by Page
  175. indexes all of the content (including blocks) that is in the content region of
  176. your page. This would normally be OK, but the problem is that the Context module
  177. does a lot of PHP variable local static caching during each page load
  178. request. During normal site operation, this is fine because after each single
  179. page request, the PHP local variables are cleared out, so they will be
  180. recalculated for the next page load. But during a cron run for search indexing,
  181. the PHP local variables are not cleared out until the end of the cron run, so
  182. the blocks that the Context module calculates should be present on the first
  183. page that is indexed will be used for all of the pages instead of being
  184. recalculated for the other pages.
  185. The solution to this problem is that the core Search module and also the Search
  186. by Page module allow you to specify how many pages can be indexed per cron
  187. run. If you set this value to 1, then there will be no chance that you'll have
  188. problems with the wrong block context being used for subsequent pages that are
  189. indexed.