You are here

function search_expand_cjk in Drupal 9

Same name and namespace in other branches
  1. 8 core/modules/search/search.module \search_expand_cjk()
  2. 4 modules/search.module \search_expand_cjk()
  3. 5 modules/search/search.module \search_expand_cjk()
  4. 6 modules/search/search.module \search_expand_cjk()
  5. 7 modules/search/search.module \search_expand_cjk()

Splits CJK (Chinese, Japanese, Korean) text into tokens.

The Search module matches exact words, where a word is defined to be a sequence of characters delimited by spaces or punctuation. CJK languages are written in long strings of characters, though, not split up into words. So in order to allow search matching, we split up CJK text into tokens consisting of consecutive, overlapping sequences of characters whose length is equal to the 'minimum_word_size' variable. This tokenizing is only done if the 'overlap_cjk' variable is TRUE.

Parameters

array $matches: This function is a callback for preg_replace_callback(), which is called from search_simplify(). So, $matches is an array of regular expression matches, which means that $matches[0] contains the matched text -- a string of CJK characters to tokenize.

Return value

string Tokenized text, starting and ending with a space character.

Deprecated

in drupal:9.1.0 and is removed from drupal:10.0.0. Use a custom implementation of SearchTextProcessorInterface instead.

See also

https://www.drupal.org/node/3078162

1 call to search_expand_cjk()
SearchDeprecationTest::testExpandCjk in core/modules/search/tests/src/Kernel/SearchDeprecationTest.php

File

core/modules/search/search.module, line 182
Enables site-wide keyword searching.

Code

function search_expand_cjk($matches) {
  @trigger_error('search_expand_cjk() is deprecated in drupal:9.1.0 and is removed from drupal:10.0.0. Use a custom implementation of SearchTextProcessorInterface instead. instead. See https://www.drupal.org/node/3078162', E_USER_DEPRECATED);
  $min = \Drupal::config('search.settings')
    ->get('index.minimum_word_size');
  $str = $matches[0];
  $length = mb_strlen($str);

  // If the text is shorter than the minimum word size, don't tokenize it.
  if ($length <= $min) {
    return ' ' . $str . ' ';
  }
  $tokens = ' ';

  // Build a FIFO queue of characters.
  $chars = [];
  for ($i = 0; $i < $length; $i++) {

    // Add the next character off the beginning of the string to the queue.
    $current = mb_substr($str, 0, 1);
    $str = substr($str, strlen($current));
    $chars[] = $current;
    if ($i >= $min - 1) {

      // Make a token of $min characters, and add it to the token string.
      $tokens .= implode('', $chars) . ' ';

      // Shift out the first character in the queue.
      array_shift($chars);
    }
  }
  return $tokens;
}