public function TokenizerTest::testSearchSimplifyUnicode in Search API 8
Tests that all Unicode characters simplify correctly.
This test uses a Drupal core search file that was constructed so that the even lines are boundary characters, and the odd lines are valid word characters. (It was generated as a sequence of all the Unicode characters, and then the boundary characters (punctuation, spaces, etc.) were split off into their own lines). So the even-numbered lines should simplify to nothing, and the odd-numbered lines we need to split into shorter chunks and verify that simplification doesn't lose any characters.
See also
\Drupal\search\Tests\SearchSimplifyTest::testSearchSimplifyUnicode()
File
- tests/
src/ Unit/ Processor/ TokenizerTest.php, line 306
Class
- TokenizerTest
- Tests the "Tokenizer" processor.
Namespace
Drupal\Tests\search_api\Unit\ProcessorCode
public function testSearchSimplifyUnicode() {
// Set the minimum word size to 1 (to split all CJK characters).
$this->processor
->setConfiguration([
'minimum_word_size' => 1,
]);
$this
->invokeMethod('prepare');
$input = file_get_contents($this->root . '/core/modules/search/tests/UnicodeTest.txt');
$basestrings = explode(chr(10), $input);
$strings = [];
foreach ($basestrings as $key => $string) {
if ($key % 2) {
// Even line, should be removed by simplifyText().
$simplified = $this
->invokeMethod('simplifyText', [
$string,
]);
$this
->assertEquals('', $simplified, "Line {$key} is excluded from the index");
}
else {
// Odd line, should be word characters (which might be expanded, but
// never removed). Split this into 30-character chunks, so we don't run
// into limits of truncation.
$start = 0;
while ($start < mb_strlen($string)) {
$newstr = mb_substr($string, $start, 30);
// Special case: leading zeros are removed from numeric strings,
// and there's one string in this file that is numbers starting with
// zero, so prepend a 1 on that string.
if (preg_match('/^[0-9]+$/', $newstr)) {
$newstr = '1' . $newstr;
}
$strings[] = $newstr;
$start += 30;
}
}
}
foreach ($strings as $key => $string) {
$simplified = $this
->invokeMethod('simplifyText', [
$string,
]);
$this
->assertGreaterThanOrEqual(mb_strlen($string), mb_strlen($simplified), "Nothing is removed from string {$key}.");
}
// Test the low-numbered ASCII control characters separately. They are not
// in the text file because they are problematic for diff, especially \0.
$string = '';
for ($i = 0; $i < 32; $i++) {
$string .= chr($i);
}
$this
->assertEquals('', $this
->invokeMethod('simplifyText', [
$string,
]), 'Text simplification works for ASCII control characters.');
}