You are here

protected function Tokenizer::simplifyText in Search API 8

Simplifies a string according to indexing rules.

Parameters

string $text: The text to simplify.

Return value

string The text with tokens split by single spaces.

See also

search_simplify()

2 calls to Tokenizer::simplifyText()
Tokenizer::process in src/Plugin/search_api/processor/Tokenizer.php
Processes a single string value.
Tokenizer::processFieldValue in src/Plugin/search_api/processor/Tokenizer.php
Processes a single text element in a field.

File

src/Plugin/search_api/processor/Tokenizer.php, line 245

Class

Tokenizer
Splits text into individual words for searching.

Namespace

Drupal\search_api\Plugin\search_api\processor

Code

protected function simplifyText($text) {

  // Optionally apply simple CJK handling to the text.
  if ($this->configuration['overlap_cjk']) {
    $text = preg_replace_callback('/[' . $this
      ->getPregClassCjk() . ']+/u', [
      $this,
      'expandCjk',
    ], $text);
  }

  // To improve searching for numerical data such as dates, IP addresses or
  // version numbers, we consider a group of numerical characters separated
  // only by punctuation characters to be one piece. This also means, for
  // example, that searching for "20/03/1984" also returns results with
  // "20-03-1984" in them.
  // Readable regular expression: "([number]+)[punctuation]+(?=[number])".
  $text = preg_replace('/([' . $this
    ->getPregClassNumbers() . ']+)[' . $this
    ->getPregClassPunctuation() . ']+(?=[' . $this
    ->getPregClassNumbers() . '])/u', '\\1', $text);

  // A group of multiple ignored characters is still treated as whitespace.
  $text = preg_replace('/[' . $this->ignored . ']{2,}/u', ' ', $text);

  // Remove all other instances of ignored characters.
  $text = preg_replace('/[' . $this->ignored . ']+/u', '', $text);

  // Finally, convert all characters we want to treat as word boundaries to
  // plain spaces.
  $text = preg_replace('/[' . $this->spaces . ']+/u', ' ', $text);
  return trim($text);
}