You are here

public function TokenizerTest::testSearchSimplifyUnicode in Search API 8

Tests that all Unicode characters simplify correctly.

This test uses a Drupal core search file that was constructed so that the even lines are boundary characters, and the odd lines are valid word characters. (It was generated as a sequence of all the Unicode characters, and then the boundary characters (punctuation, spaces, etc.) were split off into their own lines). So the even-numbered lines should simplify to nothing, and the odd-numbered lines we need to split into shorter chunks and verify that simplification doesn't lose any characters.

See also

\Drupal\search\Tests\SearchSimplifyTest::testSearchSimplifyUnicode()

File

tests/src/Unit/Processor/TokenizerTest.php, line 306

Class

TokenizerTest
Tests the "Tokenizer" processor.

Namespace

Drupal\Tests\search_api\Unit\Processor

Code

public function testSearchSimplifyUnicode() {

  // Set the minimum word size to 1 (to split all CJK characters).
  $this->processor
    ->setConfiguration([
    'minimum_word_size' => 1,
  ]);
  $this
    ->invokeMethod('prepare');
  $input = file_get_contents($this->root . '/core/modules/search/tests/UnicodeTest.txt');
  $basestrings = explode(chr(10), $input);
  $strings = [];
  foreach ($basestrings as $key => $string) {
    if ($key % 2) {

      // Even line, should be removed by simplifyText().
      $simplified = $this
        ->invokeMethod('simplifyText', [
        $string,
      ]);
      $this
        ->assertEquals('', $simplified, "Line {$key} is excluded from the index");
    }
    else {

      // Odd line, should be word characters (which might be expanded, but
      // never removed). Split this into 30-character chunks, so we don't run
      // into limits of truncation.
      $start = 0;
      while ($start < mb_strlen($string)) {
        $newstr = mb_substr($string, $start, 30);

        // Special case: leading zeros are removed from numeric strings,
        // and there's one string in this file that is numbers starting with
        // zero, so prepend a 1 on that string.
        if (preg_match('/^[0-9]+$/', $newstr)) {
          $newstr = '1' . $newstr;
        }
        $strings[] = $newstr;
        $start += 30;
      }
    }
  }
  foreach ($strings as $key => $string) {
    $simplified = $this
      ->invokeMethod('simplifyText', [
      $string,
    ]);
    $this
      ->assertGreaterThanOrEqual(mb_strlen($string), mb_strlen($simplified), "Nothing is removed from string {$key}.");
  }

  // Test the low-numbered ASCII control characters separately. They are not
  // in the text file because they are problematic for diff, especially \0.
  $string = '';
  for ($i = 0; $i < 32; $i++) {
    $string .= chr($i);
  }
  $this
    ->assertEquals('', $this
    ->invokeMethod('simplifyText', [
    $string,
  ]), 'Text simplification works for ASCII control characters.');
}