You are here

function porterstemmer_prestemming in Porter-Stemmer 7

Same name and namespace in other branches
  1. 6.2 porterstemmer.module \porterstemmer_prestemming()

Pre-processes a word for the Porter Stemmer 2 algorithm.

Checks for too-short words, removes initial apostrophes, sets y to Y (so as not to be considered a vowel) if y is at start of word or after a vowel. Then calculates the position of the R1 and R2 regions in the word.

Parameters

string $word: Word to stem, modified in place if successful.

int $r1: Returns the start position of the "R1" region in the word.

int $r2: Returns the start position of the "R2" region in the word.

Return value

bool TRUE if it is time to stop stemming, FALSE to continue.

5 calls to porterstemmer_prestemming()
PorterStemmerInternalsUnitTest::testAdministered in ./porterstemmer.test
Test internal steps on the word "administered".
PorterStemmerInternalsUnitTest::testBaked in ./porterstemmer.test
Test internal steps on the word "baked".
PorterStemmerInternalsUnitTest::testGeology in ./porterstemmer.test
Test internal steps on the word "geology".
PorterStemmerInternalsUnitTest::testIesIed in ./porterstemmer.test
Test internal steps on the words "ies" and "ied".
porterstemmer_stem in includes/standard-stemmer.inc
Stems a word, using the Porter Stemmer 2 algorithm.

File

includes/standard-stemmer.inc, line 216
This is an implementation of the Porter 2 Stemming algorithm.

Code

function porterstemmer_prestemming(&$word, &$r1, &$r2) {
  if (porterstemmer_too_short($word)) {
    return TRUE;
  }
  $tmp = $word;

  // Remove initial apostrophe.
  $tmp = preg_replace("/^'/", '', $tmp);
  if (porterstemmer_too_short($tmp)) {
    return TRUE;
  }

  // Make y -> Y if we should treat it as consonant.
  $tmp = preg_replace('/^y/', 'Y', $tmp);
  $before = 'not going to match';
  while ($before != $tmp) {

    // Do this replacement one by one, to avoid unlikely yyyy issues.
    $before = $tmp;

    // Note: do not use count param to preg_replace - added in 5.10!!
    $tmp = preg_replace('/(' . PORTERSTEMMER_VOWEL . ')y/', '$1Y', $tmp, 1);
  }

  // This y/Y step should not have changed the word length.
  $word = $tmp;

  // Find R1 and R2. R1 is the region after the first non-vowel
  // following a vowel. R2 is the region after the first non-vowel
  // following a vowel in R1.
  $max = drupal_strlen($word);
  $r1 = $max;
  $r2 = $max;
  $matches = array();
  $rdef = '/^' . PORTERSTEMMER_NOT_VOWEL . '*' . PORTERSTEMMER_VOWEL . '+(' . PORTERSTEMMER_NOT_VOWEL . ')/';

  // Exceptions to R1: If word begins with 'gener', 'commun', or 'arsen',
  // R1 is the remainder of the word.
  if (preg_match('/^(gener|commun|arsen)/', $word, $matches)) {
    $r1 = drupal_strlen($matches[1]);
  }
  elseif (preg_match($rdef, $word, $matches, PREG_OFFSET_CAPTURE)) {
    $r1 = $matches[1][1] + 1;
  }
  $R1 = drupal_substr($word, $r1);
  if ($R1 && preg_match($rdef, $R1, $matches, PREG_OFFSET_CAPTURE)) {
    $r2 = $r1 + $matches[1][1] + 1;
  }
  return FALSE;
}