You are here

function _strip_symbols in Bibliography Module 7

Same name and namespace in other branches
  1. 6.2 includes/biblio.util.inc \_strip_symbols()
  2. 7.2 includes/biblio.util.inc \_strip_symbols()

Copyright (c) 2008, David R. Nadeau, NadeauSoftware.com. All rights reserved.

Strip symbol characters from UTF-8 text.

Characters stripped from the text include characters in the following Unicode categories:

Modifier symbols Private use symbols Math symbols Other symbols

Exceptions are made for math symbols embedded within numbers (such as + - /), math symbols used within URLs (such as = ~), units of measure symbols, and ideograph parts. Currency symbols are not removed.

Parameters: text the UTF-8 text to strip

Return values: the stripped UTF-8 text.

See also: http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_char...

File

includes/biblio.util.inc, line 358

Code

function _strip_symbols($text) {
  $plus = '\\+\\x{FE62}\\x{FF0B}\\x{208A}\\x{207A}';
  $minus = '\\x{2012}\\x{208B}\\x{207B}';
  $units = '\\x{00B0}\\x{2103}\\x{2109}\\x{23CD}';
  $units .= '\\x{32CC}-\\x{32CE}';
  $units .= '\\x{3300}-\\x{3357}';
  $units .= '\\x{3371}-\\x{33DF}';
  $units .= '\\x{33FF}';
  $ideo = '\\x{2E80}-\\x{2EF3}';
  $ideo .= '\\x{2F00}-\\x{2FD5}';
  $ideo .= '\\x{2FF0}-\\x{2FFB}';
  $ideo .= '\\x{3037}-\\x{303F}';
  $ideo .= '\\x{3190}-\\x{319F}';
  $ideo .= '\\x{31C0}-\\x{31CF}';
  $ideo .= '\\x{32C0}-\\x{32CB}';
  $ideo .= '\\x{3358}-\\x{3370}';
  $ideo .= '\\x{33E0}-\\x{33FE}';
  $ideo .= '\\x{A490}-\\x{A4C6}';
  return preg_replace(array(
    // Remove modifier and private use symbols.
    '/[\\p{Sk}\\p{Co}]/u',
    // Remove math symbols except + - = ~ and fraction slash.
    '/\\p{Sm}(?<![' . $plus . $minus . '=~\\x{2044}])/u',
    // Remove + - if space before, no number or currency after.
    '/((?<= )|^)[' . $plus . $minus . ']+((?![\\p{N}\\p{Sc}])|$)/u',
    // Remove = if space before.
    '/((?<= )|^)=+/u',
    // Remove + - = ~ if space after.
    '/[' . $plus . $minus . '=~]+((?= )|$)/u',
    // Remove other symbols except units and ideograph parts.
    '/\\p{So}(?<![' . $units . $ideo . '])/u',
    // Remove consecutive white space.
    '/ +/',
  ), ' ', $text);
}