You are here

public function SearchTextProcessor::analyze in Drupal 9

Runs the text through character analyzers in preparation for indexing.

Processing steps:

  • Entities are decoded.
  • Text is lower-cased and diacritics (accents) are removed.
  • hook_search_preprocess() is invoked.
  • CJK (Chinese, Japanese, Korean) characters are processed, depending on the search settings.
  • Punctuation is processed (removed or replaced with spaces, depending on where it is; see code for details).
  • Words are truncated to 50 characters maximum.

Parameters

string $text: Text to simplify.

string|null $langcode: (optional) Language code for the language of $text, if known.

Return value

string Simplified and processed text.

Overrides SearchTextProcessorInterface::analyze

See also

hook_search_preprocess()

1 call to SearchTextProcessor::analyze()
SearchTextProcessor::process in core/modules/search/src/SearchTextProcessor.php
Processes text into words for indexing.

File

core/modules/search/src/SearchTextProcessor.php, line 64

Class

SearchTextProcessor
Processes search text for indexing.

Namespace

Drupal\search

Code

public function analyze(string $text, ?string $langcode = NULL) : string {

  // Decode entities to UTF-8.
  $text = Html::decodeEntities($text);

  // Lowercase.
  $text = mb_strtolower($text);

  // Remove diacritics.
  $text = $this->transliteration
    ->removeDiacritics($text);

  // Call an external processor for word handling.
  $this
    ->invokePreprocess($text, $langcode);

  // Simple CJK handling.
  if ($this->configFactory
    ->get('search.settings')
    ->get('index.overlap_cjk')) {
    $text = preg_replace_callback('/[' . self::PREG_CLASS_CJK . ']+/u', [
      $this,
      'expandCjk',
    ], $text);
  }

  // To improve searching for numerical data such as dates, IP addresses
  // or version numbers, we consider a group of numerical characters
  // separated only by punctuation characters to be one piece.
  // This also means that searching for e.g. '20/03/1984' also returns
  // results with '20-03-1984' in them.
  // Readable regexp: ([number]+)[punctuation]+(?=[number])
  $text = preg_replace('/([' . self::PREG_CLASS_NUMBERS . ']+)[' . self::PREG_CLASS_PUNCTUATION . ']+(?=[' . self::PREG_CLASS_NUMBERS . '])/u', '\\1', $text);

  // Multiple dot and dash groups are word boundaries and replaced with space.
  // No need to use the unicode modifier here because 0-127 ASCII characters
  // can't match higher UTF-8 characters as the leftmost bit of those are 1.
  $text = preg_replace('/[.-]{2,}/', ' ', $text);

  // The dot, underscore and dash are simply removed. This allows meaningful
  // search behavior with acronyms and URLs. See unicode note directly above.
  $text = preg_replace('/[._-]+/', '', $text);

  // With the exception of the rules above, we consider all punctuation,
  // marks, spacers, etc, to be a word boundary.
  $text = preg_replace('/[' . Unicode::PREG_CLASS_WORD_BOUNDARY . ']+/u', ' ', $text);

  // Truncate everything to 50 characters.
  $words = explode(' ', $text);
  array_walk($words, [
    $this,
    'truncate',
  ]);
  $text = implode(' ', $words);
  return $text;
}