function search_simplify in Zircon Profile 8
Same name and namespace in other branches
- 8.0 core/modules/search/search.module \search_simplify()
Simplifies and preprocesses text for searching.
Processing steps:
- Entities are decoded.
- Text is lower-cased and diacritics (accents) are removed.
- hook_search_preprocess() is invoked.
- CJK (Chinese, Japanese, Korean) characters are processed, depending on the search settings.
- Punctuation is processed (removed or replaced with spaces, depending on where it is; see code for details).
- Words are truncated to 50 characters maximum.
Parameters
string $text: Text to simplify.
string|null $langcode: Language code for the language of $text, if known.
Return value
string Simplified and processed text.
See also
7 calls to search_simplify()
- SearchQuery::parseSearchExpression in core/
modules/ search/ src/ SearchQuery.php - Parses the search query into SQL conditions.
- SearchSimplifyTest::testSearchSimplifyPunctuation in core/
modules/ search/ src/ Tests/ SearchSimplifyTest.php - Tests that search_simplify() does the right thing with punctuation.
- SearchSimplifyTest::testSearchSimplifyUnicode in core/
modules/ search/ src/ Tests/ SearchSimplifyTest.php - Tests that all Unicode characters simplify correctly.
- SearchTokenizerTest::testNoTokenizer in core/
modules/ search/ src/ Tests/ SearchTokenizerTest.php - Verifies that strings of non-CJK characters are not tokenized.
- SearchTokenizerTest::testTokenizer in core/
modules/ search/ src/ Tests/ SearchTokenizerTest.php - Verifies that strings of CJK characters are tokenized.
File
- core/
modules/ search/ search.module, line 255 - Enables site-wide keyword searching.
Code
function search_simplify($text, $langcode = NULL) {
// Decode entities to UTF-8
$text = Html::decodeEntities($text);
// Lowercase
$text = Unicode::strtolower($text);
// Remove diacritics.
$text = \Drupal::service('transliteration')
->removeDiacritics($text);
// Call an external processor for word handling.
search_invoke_preprocess($text, $langcode);
// Simple CJK handling
if (\Drupal::config('search.settings')
->get('index.overlap_cjk')) {
$text = preg_replace_callback('/[' . PREG_CLASS_CJK . ']+/u', 'search_expand_cjk', $text);
}
// To improve searching for numerical data such as dates, IP addresses
// or version numbers, we consider a group of numerical characters
// separated only by punctuation characters to be one piece.
// This also means that searching for e.g. '20/03/1984' also returns
// results with '20-03-1984' in them.
// Readable regexp: ([number]+)[punctuation]+(?=[number])
$text = preg_replace('/([' . PREG_CLASS_NUMBERS . ']+)[' . PREG_CLASS_PUNCTUATION . ']+(?=[' . PREG_CLASS_NUMBERS . '])/u', '\\1', $text);
// Multiple dot and dash groups are word boundaries and replaced with space.
// No need to use the unicode modifier here because 0-127 ASCII characters
// can't match higher UTF-8 characters as the leftmost bit of those are 1.
$text = preg_replace('/[.-]{2,}/', ' ', $text);
// The dot, underscore and dash are simply removed. This allows meaningful
// search behavior with acronyms and URLs. See unicode note directly above.
$text = preg_replace('/[._-]+/', '', $text);
// With the exception of the rules above, we consider all punctuation,
// marks, spacers, etc, to be a word boundary.
$text = preg_replace('/[' . Unicode::PREG_CLASS_WORD_BOUNDARY . ']+/u', ' ', $text);
// Truncate everything to 50 characters.
$words = explode(' ', $text);
array_walk($words, '_search_index_truncate');
$text = implode(' ', $words);
return $text;
}