You are here

function _strip_punctuation_utf8 in Bibliography Module 7

Same name and namespace in other branches
  1. 6.2 includes/biblio.util.inc \_strip_punctuation_utf8()
  2. 7.2 includes/biblio.util.inc \_strip_punctuation_utf8()

Copyright (c) 2008, David R. Nadeau, NadeauSoftware.com. All rights reserved.

Strip punctuation characters from UTF-8 text.

Characters stripped from the text include characters in the following Unicode categories:

Separators Control characters Formatting characters Surrogates Open and close quotes Open and close brackets Dashes Connectors Numer separators Spaces Other punctuation

Exceptions are made for punctuation characters that occur withn URLs (such as [ ] : ; @ & ? and others), within numbers (such as . , % # '), and within words (such as - and ').

Parameters: text the UTF-8 text to strip

Return values: the stripped UTF-8 text.

See also: http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_...

1 call to _strip_punctuation_utf8()
biblio_normalize_title in includes/biblio.util.inc

File

includes/biblio.util.inc, line 287

Code

function _strip_punctuation_utf8($text) {
  $urlbrackets = '\\[\\]\\(\\)';
  $urlspacebefore = ':;\'_\\*%@&?!' . $urlbrackets;
  $urlspaceafter = '\\.,:;\'\\-_\\*@&\\/\\\\\\?!#' . $urlbrackets;
  $urlall = '\\.,:;\'\\-_\\*%@&\\/\\\\\\?!#' . $urlbrackets;
  $specialquotes = '\'"\\*<>';
  $fullstop = '\\x{002E}\\x{FE52}\\x{FF0E}';
  $comma = '\\x{002C}\\x{FE50}\\x{FF0C}';
  $arabsep = '\\x{066B}\\x{066C}';
  $numseparators = $fullstop . $comma . $arabsep;
  $numbersign = '\\x{0023}\\x{FE5F}\\x{FF03}';
  $percent = '\\x{066A}\\x{0025}\\x{066A}\\x{FE6A}\\x{FF05}\\x{2030}\\x{2031}';
  $prime = '\\x{2032}\\x{2033}\\x{2034}\\x{2057}';
  $nummodifiers = $numbersign . $percent . $prime;
  return preg_replace(array(
    // Remove separator, control, formatting, surrogate,
    // open/close quotes.
    '/[\\p{Z}\\p{Cc}\\p{Cf}\\p{Cs}\\p{Pi}\\p{Pf}]/u',
    // Remove other punctuation except special cases.
    '/\\p{Po}(?<![' . $specialquotes . $numseparators . $urlall . $nummodifiers . '])/u',
    // Remove non-URL open/close brackets, except URL brackets.
    '/[\\p{Ps}\\p{Pe}](?<![' . $urlbrackets . '])/u',
    // Remove special quotes, dashes, connectors, number
    // separators, and URL characters followed by a space.
    '/[' . $specialquotes . $numseparators . $urlspaceafter . '\\p{Pd}\\p{Pc}]+((?= )|$)/u',
    // Remove special quotes, connectors, and URL characters
    // preceded by a space.
    '/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\\p{Pc}]+/u',
    // Remove dashes preceded by a space, but not followed by a number.
    '/((?<= )|^)\\p{Pd}+(?![\\p{N}\\p{Sc}])/u',
    // Remove consecutive spaces.
    '/ +/',
  ), ' ', $text);
}