protected function Tokenizer::simplifyText in Search API 8
Simplifies a string according to indexing rules.
Parameters
string $text: The text to simplify.
Return value
string The text with tokens split by single spaces.
See also
2 calls to Tokenizer::simplifyText()
- Tokenizer::process in src/
Plugin/ search_api/ processor/ Tokenizer.php - Processes a single string value.
- Tokenizer::processFieldValue in src/
Plugin/ search_api/ processor/ Tokenizer.php - Processes a single text element in a field.
File
- src/
Plugin/ search_api/ processor/ Tokenizer.php, line 245
Class
- Tokenizer
- Splits text into individual words for searching.
Namespace
Drupal\search_api\Plugin\search_api\processorCode
protected function simplifyText($text) {
// Optionally apply simple CJK handling to the text.
if ($this->configuration['overlap_cjk']) {
$text = preg_replace_callback('/[' . $this
->getPregClassCjk() . ']+/u', [
$this,
'expandCjk',
], $text);
}
// To improve searching for numerical data such as dates, IP addresses or
// version numbers, we consider a group of numerical characters separated
// only by punctuation characters to be one piece. This also means, for
// example, that searching for "20/03/1984" also returns results with
// "20-03-1984" in them.
// Readable regular expression: "([number]+)[punctuation]+(?=[number])".
$text = preg_replace('/([' . $this
->getPregClassNumbers() . ']+)[' . $this
->getPregClassPunctuation() . ']+(?=[' . $this
->getPregClassNumbers() . '])/u', '\\1', $text);
// A group of multiple ignored characters is still treated as whitespace.
$text = preg_replace('/[' . $this->ignored . ']{2,}/u', ' ', $text);
// Remove all other instances of ignored characters.
$text = preg_replace('/[' . $this->ignored . ']+/u', '', $text);
// Finally, convert all characters we want to treat as word boundaries to
// plain spaces.
$text = preg_replace('/[' . $this->spaces . ']+/u', ' ', $text);
return trim($text);
}