You are here

class SearchTokenizerTest in Drupal 9

Same name and namespace in other branches
  1. 8 core/modules/search/tests/src/Kernel/SearchTokenizerTest.php \Drupal\Tests\search\Kernel\SearchTokenizerTest

Tests that CJK tokenizer works as intended.

@group search

Hierarchy

Expanded class hierarchy of SearchTokenizerTest

File

core/modules/search/tests/src/Kernel/SearchTokenizerTest.php, line 13

Namespace

Drupal\Tests\search\Kernel
View source
class SearchTokenizerTest extends KernelTestBase {

  /**
   * {@inheritdoc}
   */
  protected static $modules = [
    'search',
  ];

  /**
   * Verifies that strings of CJK characters are tokenized.
   *
   * The text analysis function does special things with numbers, symbols
   * and punctuation. So we only test that CJK characters that are not in these
   * character classes are tokenized properly. See PREG_CLASS_CKJ for more
   * information.
   */
  public function testTokenizer() {

    // Set the minimum word size to 1 (to split all CJK characters) and make
    // sure CJK tokenizing is turned on.
    $this
      ->config('search.settings')
      ->set('index.minimum_word_size', 1)
      ->set('index.overlap_cjk', TRUE)
      ->save();

    // Create a string of CJK characters from various character ranges in the
    // Unicode tables.
    // Beginnings of the character ranges.
    $starts = [
      'CJK unified' => 0x4e00,
      'CJK Ext A' => 0x3400,
      'CJK Compat' => 0xf900,
      'Hangul Jamo' => 0x1100,
      'Hangul Ext A' => 0xa960,
      'Hangul Ext B' => 0xd7b0,
      'Hangul Compat' => 0x3131,
      'Half non-punct 1' => 0xff21,
      'Half non-punct 2' => 0xff41,
      'Half non-punct 3' => 0xff66,
      'Hangul Syllables' => 0xac00,
      'Hiragana' => 0x3040,
      'Katakana' => 0x30a1,
      'Katakana Ext' => 0x31f0,
      'CJK Reserve 1' => 0x20000,
      'CJK Reserve 2' => 0x30000,
      'Bomofo' => 0x3100,
      'Bomofo Ext' => 0x31a0,
      'Lisu' => 0xa4d0,
      'Yi' => 0xa000,
    ];

    // Ends of the character ranges.
    $ends = [
      'CJK unified' => 0x9fcf,
      'CJK Ext A' => 0x4dbf,
      'CJK Compat' => 0xfaff,
      'Hangul Jamo' => 0x11ff,
      'Hangul Ext A' => 0xa97f,
      'Hangul Ext B' => 0xd7ff,
      'Hangul Compat' => 0x318e,
      'Half non-punct 1' => 0xff3a,
      'Half non-punct 2' => 0xff5a,
      'Half non-punct 3' => 0xffdc,
      'Hangul Syllables' => 0xd7af,
      'Hiragana' => 0x309f,
      'Katakana' => 0x30ff,
      'Katakana Ext' => 0x31ff,
      'CJK Reserve 1' => 0x2fffd,
      'CJK Reserve 2' => 0x3fffd,
      'Bomofo' => 0x312f,
      'Bomofo Ext' => 0x31b7,
      'Lisu' => 0xa4fd,
      'Yi' => 0xa48f,
    ];

    // Generate characters consisting of starts, midpoints, and ends.
    $chars = [];
    foreach ($starts as $key => $value) {
      $chars[] = $this
        ->code2utf($starts[$key]);
      $mid = round(0.5 * ($starts[$key] + $ends[$key]));
      $chars[] = $this
        ->code2utf($mid);
      $chars[] = $this
        ->code2utf($ends[$key]);
    }

    // Merge into a string and tokenize.
    $string = implode('', $chars);
    $text_processor = \Drupal::service('search.text_processor');
    assert($text_processor instanceof SearchTextProcessorInterface);
    $out = trim($text_processor
      ->analyze($string));
    $expected = mb_strtolower(implode(' ', $chars));

    // Verify that the output matches what we expect.
    $this
      ->assertEquals($expected, $out, 'CJK tokenizer worked on all supplied CJK characters');
  }

  /**
   * Verifies that strings of non-CJK characters are not tokenized.
   *
   * This is just a sanity check - it verifies that strings of letters are
   * not tokenized.
   */
  public function testNoTokenizer() {

    // Set the minimum word size to 1 (to split all CJK characters) and make
    // sure CJK tokenizing is turned on.
    $this
      ->config('search.settings')
      ->set('index.minimum_word_size', 1)
      ->set('index.overlap_cjk', TRUE)
      ->save();
    $letters = 'abcdefghijklmnopqrstuvwxyz';
    $text_processor = \Drupal::service('search.text_processor');
    assert($text_processor instanceof SearchTextProcessorInterface);
    $out = trim($text_processor
      ->analyze($letters));
    $this
      ->assertEquals($letters, $out, 'Letters are not CJK tokenized');
  }

  /**
   * Like PHP chr() function, but for unicode characters.
   *
   * Function chr() only works for ASCII characters up to character 255. This
   * function converts a number to the corresponding unicode character. Adapted
   * from functions supplied in comments on several functions on php.net.
   */
  public function code2utf($num) {
    if ($num < 128) {
      return chr($num);
    }
    if ($num < 2048) {
      return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
    }
    if ($num < 65536) {
      return chr(($num >> 12) + 224) . chr(($num >> 6 & 63) + 128) . chr(($num & 63) + 128);
    }
    if ($num < 2097152) {
      return chr(($num >> 18) + 240) . chr(($num >> 12 & 63) + 128) . chr(($num >> 6 & 63) + 128) . chr(($num & 63) + 128);
    }
    return '';
  }

}

Members

Namesort descending Modifiers Type Description Overrides
AssertContentTrait::$content protected property The current raw content.
AssertContentTrait::$drupalSettings protected property The drupalSettings value from the current raw $content.
AssertContentTrait::$elements protected property The XML structure parsed from the current raw $content. 1
AssertContentTrait::$plainTextContent protected property The plain-text content of raw $content (text nodes).
AssertContentTrait::assertEscaped protected function Passes if the raw text IS found escaped on the loaded page, fail otherwise.
AssertContentTrait::assertField protected function Asserts that a field exists with the given name or ID.
AssertContentTrait::assertFieldById protected function Asserts that a field exists with the given ID and value.
AssertContentTrait::assertFieldByName protected function Asserts that a field exists with the given name and value.
AssertContentTrait::assertFieldByXPath protected function Asserts that a field exists in the current page by the given XPath.
AssertContentTrait::assertFieldChecked protected function Asserts that a checkbox field in the current page is checked.
AssertContentTrait::assertFieldsByValue protected function Asserts that a field exists in the current page with a given Xpath result.
AssertContentTrait::assertLink protected function Passes if a link with the specified label is found.
AssertContentTrait::assertLinkByHref protected function Passes if a link containing a given href (part) is found.
AssertContentTrait::assertNoDuplicateIds protected function Asserts that each HTML ID is used for just a single element.
AssertContentTrait::assertNoEscaped protected function Passes if the raw text IS NOT found escaped on the loaded page, fail otherwise.
AssertContentTrait::assertNoField protected function Asserts that a field does not exist with the given name or ID.
AssertContentTrait::assertNoFieldById protected function Asserts that a field does not exist with the given ID and value.
AssertContentTrait::assertNoFieldByName protected function Asserts that a field does not exist with the given name and value.
AssertContentTrait::assertNoFieldByXPath protected function Asserts that a field does not exist or its value does not match, by XPath.
AssertContentTrait::assertNoFieldChecked protected function Asserts that a checkbox field in the current page is not checked.
AssertContentTrait::assertNoLink protected function Passes if a link with the specified label is not found.
AssertContentTrait::assertNoLinkByHref protected function Passes if a link containing a given href (part) is not found.
AssertContentTrait::assertNoLinkByHrefInMainRegion protected function Passes if a link containing a given href is not found in the main region.
AssertContentTrait::assertNoOption protected function Asserts that a select option in the current page does not exist.
AssertContentTrait::assertNoOptionSelected protected function Asserts that a select option in the current page is not checked.
AssertContentTrait::assertNoPattern protected function Triggers a pass if the perl regex pattern is not found in raw content.
AssertContentTrait::assertNoRaw protected function Passes if the raw text is NOT found on the loaded page, fail otherwise.
AssertContentTrait::assertNoText protected function Passes if the page (with HTML stripped) does not contains the text.
AssertContentTrait::assertNoTitle protected function Pass if the page title is not the given string.
AssertContentTrait::assertNoUniqueText protected function Passes if the text is found MORE THAN ONCE on the text version of the page.
AssertContentTrait::assertOption protected function Asserts that a select option in the current page exists.
AssertContentTrait::assertOptionByText protected function Asserts that a select option with the visible text exists.
AssertContentTrait::assertOptionSelected protected function Asserts that a select option in the current page is checked.
AssertContentTrait::assertOptionSelectedWithDrupalSelector protected function Asserts that a select option in the current page is checked.
AssertContentTrait::assertOptionWithDrupalSelector protected function Asserts that a select option in the current page exists.
AssertContentTrait::assertPattern protected function Triggers a pass if the Perl regex pattern is found in the raw content.
AssertContentTrait::assertRaw protected function Passes if the raw text IS found on the loaded page, fail otherwise.
AssertContentTrait::assertText protected function Passes if the page (with HTML stripped) contains the text.
AssertContentTrait::assertTextHelper protected function Helper for assertText and assertNoText.
AssertContentTrait::assertTextPattern protected function Asserts that a Perl regex pattern is found in the plain-text content.
AssertContentTrait::assertThemeOutput protected function Asserts themed output.
AssertContentTrait::assertTitle protected function Pass if the page title is the given string.
AssertContentTrait::assertUniqueText protected function Passes if the text is found ONLY ONCE on the text version of the page.
AssertContentTrait::assertUniqueTextHelper protected function Helper for assertUniqueText and assertNoUniqueText.
AssertContentTrait::buildXPathQuery protected function Builds an XPath query.
AssertContentTrait::constructFieldXpath protected function Helper: Constructs an XPath for the given set of attributes and value.
AssertContentTrait::cssSelect protected function Searches elements using a CSS selector in the raw content.
AssertContentTrait::getAllOptions protected function Get all option elements, including nested options, in a select.
AssertContentTrait::getDrupalSettings protected function Gets the value of drupalSettings for the currently-loaded page.
AssertContentTrait::getRawContent protected function Gets the current raw content.
AssertContentTrait::getSelectedItem protected function Get the selected value from a select field.
AssertContentTrait::getTextContent protected function Retrieves the plain-text content from the current raw content.
AssertContentTrait::getUrl protected function Get the current URL from the cURL handler. 1
AssertContentTrait::parse protected function Parse content returned from curlExec using DOM and SimpleXML.
AssertContentTrait::removeWhiteSpace protected function Removes all white-space between HTML tags from the raw content.
AssertContentTrait::setDrupalSettings protected function Sets the value of drupalSettings for the currently-loaded page.
AssertContentTrait::setRawContent protected function Sets the raw content (e.g. HTML).
AssertContentTrait::xpath protected function Performs an xpath search on the contents of the internal browser.
AssertLegacyTrait::assert Deprecated protected function
AssertLegacyTrait::assertEqual Deprecated protected function
AssertLegacyTrait::assertIdentical Deprecated protected function
AssertLegacyTrait::assertIdenticalObject Deprecated protected function
AssertLegacyTrait::assertNotEqual Deprecated protected function
AssertLegacyTrait::assertNotIdentical Deprecated protected function
AssertLegacyTrait::pass Deprecated protected function
AssertLegacyTrait::verbose Deprecated protected function
ConfigTestTrait::configImporter protected function Returns a ConfigImporter object to import test configuration.
ConfigTestTrait::copyConfig protected function Copies configuration objects from source storage to target storage.
ExtensionListTestTrait::getModulePath protected function Gets the path for the specified module.
ExtensionListTestTrait::getThemePath protected function Gets the path for the specified theme.
KernelTestBase::$backupGlobals protected property Back up and restore any global variables that may be changed by tests.
KernelTestBase::$backupStaticAttributes protected property Back up and restore static class properties that may be changed by tests.
KernelTestBase::$backupStaticAttributesBlacklist protected property Contains a few static class properties for performance.
KernelTestBase::$classLoader protected property
KernelTestBase::$configImporter protected property @todo Move into Config test base class. 7
KernelTestBase::$configSchemaCheckerExclusions protected static property An array of config object names that are excluded from schema checking.
KernelTestBase::$container protected property
KernelTestBase::$databasePrefix protected property
KernelTestBase::$preserveGlobalState protected property Do not forward any global state from the parent process to the processes that run the actual tests.
KernelTestBase::$root protected property The app root.
KernelTestBase::$runTestInSeparateProcess protected property Kernel tests are run in separate processes because they allow autoloading of code from extensions. Running the test in a separate process isolates this behavior from other tests. Subclasses should not override this property.
KernelTestBase::$siteDirectory protected property
KernelTestBase::$strictConfigSchema protected property Set to TRUE to strict check all configuration saved. 6
KernelTestBase::$vfsRoot protected property The virtual filesystem root directory.
KernelTestBase::assertPostConditions protected function 1
KernelTestBase::bootEnvironment protected function Bootstraps a basic test environment.
KernelTestBase::bootKernel private function Bootstraps a kernel for a test.
KernelTestBase::config protected function Configuration accessor for tests. Returns non-overridden configuration.
KernelTestBase::disableModules protected function Disables modules for this test.
KernelTestBase::enableModules protected function Enables modules for this test.
KernelTestBase::getConfigSchemaExclusions protected function Gets the config schema exclusions for this test.
KernelTestBase::getDatabaseConnectionInfo protected function Returns the Database connection info to be used for this test. 3
KernelTestBase::getDatabasePrefix public function
KernelTestBase::getExtensionsForModules private function Returns Extension objects for $modules to enable.
KernelTestBase::getModulesToEnable private static function Returns the modules to enable for this test.
KernelTestBase::initFileCache protected function Initializes the FileCache component.
KernelTestBase::installConfig protected function Installs default configuration for a given list of modules.
KernelTestBase::installEntitySchema protected function Installs the storage schema for a specific entity type.
KernelTestBase::installSchema protected function Installs database tables from a module schema definition.
KernelTestBase::prepareTemplate protected function
KernelTestBase::register public function Registers test-specific services. Overrides ServiceProviderInterface::register 24
KernelTestBase::render protected function Renders a render array. 1
KernelTestBase::setInstallProfile protected function Sets the install profile and rebuilds the container to update it.
KernelTestBase::setSetting protected function Sets an in-memory Settings variable.
KernelTestBase::setUp protected function 334
KernelTestBase::setUpBeforeClass public static function 1
KernelTestBase::setUpFilesystem protected function Sets up the filesystem, so things like the file directory. 2
KernelTestBase::stop protected function Stops test execution.
KernelTestBase::tearDown protected function 4
KernelTestBase::tearDownCloseDatabaseConnection public function @after
KernelTestBase::vfsDump protected function Dumps the current state of the virtual filesystem to STDOUT.
KernelTestBase::__sleep public function Prevents serializing any properties.
PhpUnitWarnings::$deprecationWarnings private static property Deprecation warnings from PHPUnit to raise with @trigger_error().
PhpUnitWarnings::addWarning public function Converts PHPUnit deprecation warnings to E_USER_DEPRECATED.
RandomGeneratorTrait::$randomGenerator protected property The random generator.
RandomGeneratorTrait::getRandomGenerator protected function Gets the random generator for the utility methods.
RandomGeneratorTrait::randomMachineName protected function Generates a unique random string containing letters and numbers. 1
RandomGeneratorTrait::randomObject public function Generates a random PHP object.
RandomGeneratorTrait::randomString public function Generates a pseudo-random string of ASCII characters of codes 32 to 126.
RandomGeneratorTrait::randomStringValidate public function Callback for random string validation.
SearchTokenizerTest::$modules protected static property Modules to enable. Overrides KernelTestBase::$modules
SearchTokenizerTest::code2utf public function Like PHP chr() function, but for unicode characters.
SearchTokenizerTest::testNoTokenizer public function Verifies that strings of non-CJK characters are not tokenized.
SearchTokenizerTest::testTokenizer public function Verifies that strings of CJK characters are tokenized.
StorageCopyTrait::replaceStorageContents protected static function Copy the configuration from one storage to another and remove stale items.
TestRequirementsTrait::checkModuleRequirements private function Checks missing module requirements.
TestRequirementsTrait::checkRequirements protected function Check module requirements for the Drupal use case. 1
TestRequirementsTrait::getDrupalRoot protected static function Returns the Drupal root directory.