function porterstemmer_prestemming in Porter-Stemmer 7
Same name and namespace in other branches
- 6.2 porterstemmer.module \porterstemmer_prestemming()
Pre-processes a word for the Porter Stemmer 2 algorithm.
Checks for too-short words, removes initial apostrophes, sets y to Y (so as not to be considered a vowel) if y is at start of word or after a vowel. Then calculates the position of the R1 and R2 regions in the word.
Parameters
string $word: Word to stem, modified in place if successful.
int $r1: Returns the start position of the "R1" region in the word.
int $r2: Returns the start position of the "R2" region in the word.
Return value
bool TRUE if it is time to stop stemming, FALSE to continue.
5 calls to porterstemmer_prestemming()
- PorterStemmerInternalsUnitTest::testAdministered in ./
porterstemmer.test - Test internal steps on the word "administered".
- PorterStemmerInternalsUnitTest::testBaked in ./
porterstemmer.test - Test internal steps on the word "baked".
- PorterStemmerInternalsUnitTest::testGeology in ./
porterstemmer.test - Test internal steps on the word "geology".
- PorterStemmerInternalsUnitTest::testIesIed in ./
porterstemmer.test - Test internal steps on the words "ies" and "ied".
- porterstemmer_stem in includes/
standard-stemmer.inc - Stems a word, using the Porter Stemmer 2 algorithm.
File
- includes/
standard-stemmer.inc, line 216 - This is an implementation of the Porter 2 Stemming algorithm.
Code
function porterstemmer_prestemming(&$word, &$r1, &$r2) {
if (porterstemmer_too_short($word)) {
return TRUE;
}
$tmp = $word;
// Remove initial apostrophe.
$tmp = preg_replace("/^'/", '', $tmp);
if (porterstemmer_too_short($tmp)) {
return TRUE;
}
// Make y -> Y if we should treat it as consonant.
$tmp = preg_replace('/^y/', 'Y', $tmp);
$before = 'not going to match';
while ($before != $tmp) {
// Do this replacement one by one, to avoid unlikely yyyy issues.
$before = $tmp;
// Note: do not use count param to preg_replace - added in 5.10!!
$tmp = preg_replace('/(' . PORTERSTEMMER_VOWEL . ')y/', '$1Y', $tmp, 1);
}
// This y/Y step should not have changed the word length.
$word = $tmp;
// Find R1 and R2. R1 is the region after the first non-vowel
// following a vowel. R2 is the region after the first non-vowel
// following a vowel in R1.
$max = drupal_strlen($word);
$r1 = $max;
$r2 = $max;
$matches = array();
$rdef = '/^' . PORTERSTEMMER_NOT_VOWEL . '*' . PORTERSTEMMER_VOWEL . '+(' . PORTERSTEMMER_NOT_VOWEL . ')/';
// Exceptions to R1: If word begins with 'gener', 'commun', or 'arsen',
// R1 is the remainder of the word.
if (preg_match('/^(gener|commun|arsen)/', $word, $matches)) {
$r1 = drupal_strlen($matches[1]);
}
elseif (preg_match($rdef, $word, $matches, PREG_OFFSET_CAPTURE)) {
$r1 = $matches[1][1] + 1;
}
$R1 = drupal_substr($word, $r1);
if ($R1 && preg_match($rdef, $R1, $matches, PREG_OFFSET_CAPTURE)) {
$r2 = $r1 + $matches[1][1] + 1;
}
return FALSE;
}