protected function Tokenizer::expandCjk in Search API 8

Splits CJK (Chinese, Japanese, Korean) text into tokens.

Callback for preg_replace_callback() in simplifyText().

Normally, searches should match exact words, where a word is defined to be a sequence of characters delimited by spaces or punctuation. CJK languages are written in long strings of characters, though, not split up into words. So in order to allow search matching, we split up CJK text into tokens consisting of consecutive, overlapping sequences of characters whose length is equal to the "minimum_word_size" setting. This tokenizing is only done if the "overlap_cjk" setting is enabled.

Parameters

array $matches: A PCRE match array, containing the complete match as the only element.

Return value

string Tokenized text, with tokens separated with space characters and starting and ending with a space.

File

src/Plugin/search_api/processor/Tokenizer.php, line 294

Class

Tokenizer: Splits text into individual words for searching.

Namespace

Drupal\search_api\Plugin\search_api\processor

Code

protected function expandCjk(array $matches) {
  $min = $this->configuration['minimum_word_size'];
  $str = $matches[0];
  $length = mb_strlen($str);

  // If the text is shorter than the minimum word size, don't tokenize it.
  if ($length <= $min) {
    return ' ' . $str . ' ';
  }
  $tokens = ' ';

  // Build a FIFO queue of characters.
  $chars = [];
  for ($i = 0; $i < $length; $i++) {

    // Add the next character off the beginning of the string to the queue.
    $current = mb_substr($str, 0, 1);
    $str = substr($str, strlen($current));
    $chars[] = $current;
    if ($i >= $min - 1) {

      // Make a token of $min characters, and add it to the token string.
      $tokens .= implode('', $chars) . ' ';

      // Shift out the first character in the queue.
      array_shift($chars);
    }
  }
  return $tokens;
}

You are here