Search by Page module for Drupal
This module adds searching to the core Drupal search module that is oriented by
page. It can be used as an additional tab in the core Search page, or you can
display Search by Page separately (though it requires the core Search module to
be enabled, because it uses Search for its indexing).
Contents of this file:
- Introduction
- How it works
- Requirements and Setup
- Installation and Usage
- Configuration
- Theming
- Users and roles
- Other suggestions
-- How it works --
The core Search module works by indexing your content whenever cron is run, and
then looking in that index when someone requests a search on your site. The
Content tab of Search indexes all the content items on your site managed
by the core Node module, by loading the content item and indexing the resulting
content (including the body and other fields, as well as comments, following the
display settings on the content type). The User portion actually doesn't
index users -- at search time, it just looks for a user name that matches (User
Search doesn't look in any other profile fields, except that for an
administrative user doing a user search, it looks in the email field as well).
Other modules may add other tabs to Search as well.
Search by Page, in contrast, indexes the content of pages on your site, which
could be content pages, user profile pages, composites of content (such as
Views), or pages that are generated by other modules. The indexing in Search by
Page is done by first building the "content" region of each page to be indexed,
in the language(s) appropriate to the page and with the viewpoint of the role
you configure, and then adding that content to the Search index. Note that only
the "content" region of the page is indexed, not the sidebars, header, footer,
or other block regions in your theme. Also note that what is indexed is what is
output by your theme for each page, in contrast to core Search, which does not
depend on the theme's rendering of the page.
Search by Page also restricts search results to the currently-enabled language.
The core Search module only does this for content search, and only if you have
the Internationalization module enabled.
One other difference between Search by Page and the usual content search of the
core Search module is in reindexing. Search by Page assumes that your page
content might change over time, so it periodically reindexes the pages on your
site, giving priority to pages that have been edited. In contrast, the core
Search module assumes that if a content item hasn't been edited, it doesn't need
to be reindexed. You have some additional control over this reindexing -- see
the configuration section below.
Your site may experience errors during content indexing, which you can see in
the Recent Log Entries report. The typical reason is that the item that is
being indexed cannot be viewed by the user role you chose for indexing; other
errors are also possible. If this happens, Search by Page will still mark the
page as "indexed" in the search index, so that in the next cron run, it will
not try to index the page again and block working items from indexing. If you
ever want Search by Page to try indexing the failed pages again (after fixing
the cause of the error, presumably), there is a link to reset items with no
content in the index. This is located in the "Additional Actions" section of
the Search by Page configuration screen.
-- REQUIREMENTS and SETUP --
Search by Page does not know what the pages of your site are, so it doesn't
index anything by itself. You will need to enable and configure at least one
sub-module that lets you add paths to the search index, in order for this module
to do anything.
You will also need to set up one or more search "environments". Each environment
defines which paths are searchable, and has its own search URL and search block.
You will also need to make sure that Search by Page is enabled on the main
Search configuration page.
Four sub-modules are provided: "Paths" to index arbitrary paths to pages on your
site, "Nodes" to index content items of particular content types, "Users" to
index user profile pages for users of particular roles, and "Attachments" to
index files attached to content items.
The "Paths" sub-module is the most generic, but if you put a lot of paths in it,
your searches will run slower. (The technical reason is that each time someone
searches on your site, this module has to check whether that person has
permission to view each page in the list, to exclude pages the person doesn't
have permission to view from search results, and this has to be done via a PHP
loop rather than an SQL query because of how Drupal permissions work.)
IMPORTANT NOTE: If you are using Search by Page Paths, your database must be
set up with permission to create temporary tables.
The "Attachments" sub-module indexes the text in certain types of files that are
attached to content items via the core File field. This requires "helper"
programs to extract the text from file attachments, and the helper programs are
configured using the separate Search Files module (which you can find at
http://drupal.org/project/search_files). It is recommended that you only enable
the Search Files API module (and not the other included modules). This will
enable just the helper program setup functionality, without enabling the other
functionality of Search Files.
If you want to write your own sub-modules, see the search_by_page.api.php file
included with this module (or use one of the included sub-modules as an
example).
Once you have enabled sub-module(s), visit the path
admin/config/search/search_by_page to set up search environments and define
pages to index for each environment. Then wait for cron to run (or visit the
status report page, admin/reports/status, and click on "run cron manually"). No
pages will be indexed until cron has run, and no search results will come out
until pages have been indexed.
Other configuration options:
* You can change various labels and other text on the Search by Page
configuration pages.
* You can set the number of items Search by Page will index per cron run on
the Search by Page configuration pages. This is independent of the indexing
settings for the core Search module. If you are using Search by Page as an
independent search (rather than as a tab on the core Search page -- see section
below), you might want to set the core Search settings cron limit to zero, and
turn off searching for the core Node and User modules (on the core Search
Settings page), so that only Search by Page items are added to the search index.
* You can control the reindex cycling described in the How it Works section
above by using the minimum/maximum reindexing time settings, which are on a
per-module, per-environment basis. Setting the minimum reindexing time forces
Search by Page to wait at least this amount of time before reindexing that type
of page. Setting the maximum reindex time forces Search by Page to reindex that
type of page immediately when this amount of time has passed. WARNING: Do not
choose too small of a maximum reindex page globally! This setting works by
marking the pages for immediate reindexing when this time has passed, and it can
interfere with the reindexing of new content.
* You can exclude the contents of specific HTML tags from indexing.
* You will also need to set permissions, which are separate from the core Search
permissions.
* You should also visit the main Search configuration screen, where you can set
options such as the number of items to index each cron run for core Search
modules, and the minimum word size for searching. You can also watch the
progress of indexing on that page (there is a detailed table near the bottom,
in the Search by Page section).
-- INSTALLATION and USAGE --
You have two choices for how to use the module, once you have it set up:
a) There will be a new tab (called "Pages" by default), included in Drupal's
built-in search. If a site visitor performs a search from that tab, they will
get the Search by Pages results. This will use whichever search environment you
have set as the default, and may be useful if you just want to use Search by
Page to add a few pages or files to Drupal's existing search functionality.
b) You can also use Search by Page as its own entity. You will need to set up
your search environment(s) so that all the content you want to search is
available. To run Search by Page as its own entity, enable the Search by Pages
blocks for your search environments, and/or add a link to the paths you have
defined for your search environments to your navigation menu system. You will
also need to make sure that Search by Page is an enabled search module on the
Search configuration page (admin/config/search/settings).
-- CONFIGURATION --
-- Theming --
The search form that is used by Search by Page on search pages and search blocks
can be themed using the search-by-page-form.tpl.php file provided (copy that
file into your theme and modify it).
Search results are themed using the search-result.tpl.php (each result item) and
search-results.tpl.php (the list of results) theme files from the core Search
module (in directory modules/search in your Drupal installation). If you are
using Search by Page Attachments, there is an additional variable available
$result['related_node'], which gives you the node object that the attachment is
attached to.
-- Users and roles --
When you set up content items, attachments, etc. for searching within Search by
Page, you will need to choose a role to use for search indexing. This will make
Search by Page render your pages from the point of view of a user with that
role.
In order to do this, assuming you have used a non-anonymous role, Search by Page
will create its own user accounts for internal use, which you will see on your
Users management page. For instance, if you set up Search by Page Nodes to index
from the point of view of role "My role", Search by Page will set up a user
called "sbp indexing My role" with role "My role". The users that Search by Page
sets up will always have their status set to "blocked". During search indexing,
the account is set to "active" only temporarily, and only for the indexing
process, so no one should ever be able to see these users except site
administrators.
-- Other suggestions --
1. Stemming and Exact Matches
The default behavior for Drupal's core Search module (which is the technology
used for indexing/searching in Search by Page) is that only exact matches are
returned (except for the User search portion of core Search, which matches
substrings of user names). For instance, this means that if you search for
"quake", and a page contains "quakes", "quaking", or "earthquake", it will not
be matched.
To get around this limitation, I suggest using a "stemmer" module, such as
http://drupal.org/project/porterstemmer (You can search for "stemmer" on
drupal.org to find stemmers for other languages.) Stemmers enable matching on
inflected forms of words (verb forms, plurals, etc.), so they should give you
matches for "quaking" and "quakes" if you search for "quake". They wouldn't give
you a match for "earthquake", however.
2. Context module
If you are using the Context module to place blocks in the content region of
pages, and these blocks are different for different pages on your site, then you
will likely have problems with Search by Page search indexing. Search by Page
indexes all of the content (including blocks) that is in the content region of
your page. This would normally be OK, but the problem is that the Context module
does a lot of PHP variable local static caching during each page load
request. During normal site operation, this is fine because after each single
page request, the PHP local variables are cleared out, so they will be
recalculated for the next page load. But during a cron run for search indexing,
the PHP local variables are not cleared out until the end of the cron run, so
the blocks that the Context module calculates should be present on the first
page that is indexed will be used for all of the pages instead of being
recalculated for the other pages.
The solution to this problem is that the core Search module and also the Search
by Page module allow you to specify how many pages can be indexed per cron
run. If you set this value to 1, then there will be no chance that you'll have
problems with the wrong block context being used for subsequent pages that are
indexed.
View source
- Search by Page module for Drupal
-
- This module adds searching to the core Drupal search module that is oriented by
- page. It can be used as an additional tab in the core Search page, or you can
- display Search by Page separately (though it requires the core Search module to
- be enabled, because it uses Search for its indexing).
-
- Contents of this file:
- - Introduction
- - How it works
- - Requirements and Setup
- - Installation and Usage
- - Configuration
- - Theming
- - Users and roles
- - Other suggestions
-
-
- -- How it works --
-
- The core Search module works by indexing your content whenever cron is run, and
- then looking in that index when someone requests a search on your site. The
- Content tab of Search indexes all the content items on your site managed
- by the core Node module, by loading the content item and indexing the resulting
- content (including the body and other fields, as well as comments, following the
- display settings on the content type). The User portion actually doesn't
- index users -- at search time, it just looks for a user name that matches (User
- Search doesn't look in any other profile fields, except that for an
- administrative user doing a user search, it looks in the email field as well).
- Other modules may add other tabs to Search as well.
-
- Search by Page, in contrast, indexes the content of pages on your site, which
- could be content pages, user profile pages, composites of content (such as
- Views), or pages that are generated by other modules. The indexing in Search by
- Page is done by first building the "content" region of each page to be indexed,
- in the language(s) appropriate to the page and with the viewpoint of the role
- you configure, and then adding that content to the Search index. Note that only
- the "content" region of the page is indexed, not the sidebars, header, footer,
- or other block regions in your theme. Also note that what is indexed is what is
- output by your theme for each page, in contrast to core Search, which does not
- depend on the theme's rendering of the page.
-
- Search by Page also restricts search results to the currently-enabled language.
- The core Search module only does this for content search, and only if you have
- the Internationalization module enabled.
-
- One other difference between Search by Page and the usual content search of the
- core Search module is in reindexing. Search by Page assumes that your page
- content might change over time, so it periodically reindexes the pages on your
- site, giving priority to pages that have been edited. In contrast, the core
- Search module assumes that if a content item hasn't been edited, it doesn't need
- to be reindexed. You have some additional control over this reindexing -- see
- the configuration section below.
-
- Your site may experience errors during content indexing, which you can see in
- the Recent Log Entries report. The typical reason is that the item that is
- being indexed cannot be viewed by the user role you chose for indexing; other
- errors are also possible. If this happens, Search by Page will still mark the
- page as "indexed" in the search index, so that in the next cron run, it will
- not try to index the page again and block working items from indexing. If you
- ever want Search by Page to try indexing the failed pages again (after fixing
- the cause of the error, presumably), there is a link to reset items with no
- content in the index. This is located in the "Additional Actions" section of
- the Search by Page configuration screen.
-
-
- -- REQUIREMENTS and SETUP --
-
- Search by Page does not know what the pages of your site are, so it doesn't
- index anything by itself. You will need to enable and configure at least one
- sub-module that lets you add paths to the search index, in order for this module
- to do anything.
-
- You will also need to set up one or more search "environments". Each environment
- defines which paths are searchable, and has its own search URL and search block.
-
- You will also need to make sure that Search by Page is enabled on the main
- Search configuration page.
-
- Four sub-modules are provided: "Paths" to index arbitrary paths to pages on your
- site, "Nodes" to index content items of particular content types, "Users" to
- index user profile pages for users of particular roles, and "Attachments" to
- index files attached to content items.
-
- The "Paths" sub-module is the most generic, but if you put a lot of paths in it,
- your searches will run slower. (The technical reason is that each time someone
- searches on your site, this module has to check whether that person has
- permission to view each page in the list, to exclude pages the person doesn't
- have permission to view from search results, and this has to be done via a PHP
- loop rather than an SQL query because of how Drupal permissions work.)
-
- IMPORTANT NOTE: If you are using Search by Page Paths, your database must be
- set up with permission to create temporary tables.
-
- The "Attachments" sub-module indexes the text in certain types of files that are
- attached to content items via the core File field. This requires "helper"
- programs to extract the text from file attachments, and the helper programs are
- configured using the separate Search Files module (which you can find at
- http://drupal.org/project/search_files). It is recommended that you only enable
- the Search Files API module (and not the other included modules). This will
- enable just the helper program setup functionality, without enabling the other
- functionality of Search Files.
-
- If you want to write your own sub-modules, see the search_by_page.api.php file
- included with this module (or use one of the included sub-modules as an
- example).
-
- Once you have enabled sub-module(s), visit the path
- admin/config/search/search_by_page to set up search environments and define
- pages to index for each environment. Then wait for cron to run (or visit the
- status report page, admin/reports/status, and click on "run cron manually"). No
- pages will be indexed until cron has run, and no search results will come out
- until pages have been indexed.
-
- Other configuration options:
- * You can change various labels and other text on the Search by Page
- configuration pages.
- * You can set the number of items Search by Page will index per cron run on
- the Search by Page configuration pages. This is independent of the indexing
- settings for the core Search module. If you are using Search by Page as an
- independent search (rather than as a tab on the core Search page -- see section
- below), you might want to set the core Search settings cron limit to zero, and
- turn off searching for the core Node and User modules (on the core Search
- Settings page), so that only Search by Page items are added to the search index.
- * You can control the reindex cycling described in the How it Works section
- above by using the minimum/maximum reindexing time settings, which are on a
- per-module, per-environment basis. Setting the minimum reindexing time forces
- Search by Page to wait at least this amount of time before reindexing that type
- of page. Setting the maximum reindex time forces Search by Page to reindex that
- type of page immediately when this amount of time has passed. WARNING: Do not
- choose too small of a maximum reindex page globally! This setting works by
- marking the pages for immediate reindexing when this time has passed, and it can
- interfere with the reindexing of new content.
- * You can exclude the contents of specific HTML tags from indexing.
- * You will also need to set permissions, which are separate from the core Search
- permissions.
- * You should also visit the main Search configuration screen, where you can set
- options such as the number of items to index each cron run for core Search
- modules, and the minimum word size for searching. You can also watch the
- progress of indexing on that page (there is a detailed table near the bottom,
- in the Search by Page section).
-
-
- -- INSTALLATION and USAGE --
-
- You have two choices for how to use the module, once you have it set up:
-
- a) There will be a new tab (called "Pages" by default), included in Drupal's
- built-in search. If a site visitor performs a search from that tab, they will
- get the Search by Pages results. This will use whichever search environment you
- have set as the default, and may be useful if you just want to use Search by
- Page to add a few pages or files to Drupal's existing search functionality.
-
- b) You can also use Search by Page as its own entity. You will need to set up
- your search environment(s) so that all the content you want to search is
- available. To run Search by Page as its own entity, enable the Search by Pages
- blocks for your search environments, and/or add a link to the paths you have
- defined for your search environments to your navigation menu system. You will
- also need to make sure that Search by Page is an enabled search module on the
- Search configuration page (admin/config/search/settings).
-
-
- -- CONFIGURATION --
-
- -- Theming --
-
- The search form that is used by Search by Page on search pages and search blocks
- can be themed using the search-by-page-form.tpl.php file provided (copy that
- file into your theme and modify it).
-
- Search results are themed using the search-result.tpl.php (each result item) and
- search-results.tpl.php (the list of results) theme files from the core Search
- module (in directory modules/search in your Drupal installation). If you are
- using Search by Page Attachments, there is an additional variable available
- $result['related_node'], which gives you the node object that the attachment is
- attached to.
-
-
- -- Users and roles --
-
- When you set up content items, attachments, etc. for searching within Search by
- Page, you will need to choose a role to use for search indexing. This will make
- Search by Page render your pages from the point of view of a user with that
- role.
-
- In order to do this, assuming you have used a non-anonymous role, Search by Page
- will create its own user accounts for internal use, which you will see on your
- Users management page. For instance, if you set up Search by Page Nodes to index
- from the point of view of role "My role", Search by Page will set up a user
- called "sbp indexing My role" with role "My role". The users that Search by Page
- sets up will always have their status set to "blocked". During search indexing,
- the account is set to "active" only temporarily, and only for the indexing
- process, so no one should ever be able to see these users except site
- administrators.
-
-
- -- Other suggestions --
-
- 1. Stemming and Exact Matches
-
- The default behavior for Drupal's core Search module (which is the technology
- used for indexing/searching in Search by Page) is that only exact matches are
- returned (except for the User search portion of core Search, which matches
- substrings of user names). For instance, this means that if you search for
- "quake", and a page contains "quakes", "quaking", or "earthquake", it will not
- be matched.
-
- To get around this limitation, I suggest using a "stemmer" module, such as
- http://drupal.org/project/porterstemmer (You can search for "stemmer" on
- drupal.org to find stemmers for other languages.) Stemmers enable matching on
- inflected forms of words (verb forms, plurals, etc.), so they should give you
- matches for "quaking" and "quakes" if you search for "quake". They wouldn't give
- you a match for "earthquake", however.
-
- 2. Context module
-
- If you are using the Context module to place blocks in the content region of
- pages, and these blocks are different for different pages on your site, then you
- will likely have problems with Search by Page search indexing. Search by Page
- indexes all of the content (including blocks) that is in the content region of
- your page. This would normally be OK, but the problem is that the Context module
- does a lot of PHP variable local static caching during each page load
- request. During normal site operation, this is fine because after each single
- page request, the PHP local variables are cleared out, so they will be
- recalculated for the next page load. But during a cron run for search indexing,
- the PHP local variables are not cleared out until the end of the cron run, so
- the blocks that the Context module calculates should be present on the first
- page that is indexed will be used for all of the pages instead of being
- recalculated for the other pages.
-
- The solution to this problem is that the core Search module and also the Search
- by Page module allow you to specify how many pages can be indexed per cron
- run. If you set this value to 1, then there will be no chance that you'll have
- problems with the wrong block context being used for subsequent pages that are
- indexed.