You are here

function apachesolr_clean_text in Apache Solr Search 8

Same name and namespace in other branches
  1. 5.2 apachesolr.index.inc \apachesolr_clean_text()
  2. 6.3 apachesolr.module \apachesolr_clean_text()
  3. 6 apachesolr.index.inc \apachesolr_clean_text()
  4. 6.2 apachesolr.index.inc \apachesolr_clean_text()
  5. 7 apachesolr.module \apachesolr_clean_text()

Strip html tags and also control characters that cause Jetty/Solr to fail.

4 calls to apachesolr_clean_text()
apachesolr_index_add_tags_to_document in ./apachesolr.index.inc
Extract HTML tag contents from $text and add to boost fields.
apachesolr_index_node_solr_document in ./apachesolr.index.inc
Builds the node-specific information for a Solr document.
apachesolr_term_reference_indexing_callback in ./apachesolr.index.inc
Callback that converts term_reference field into an array
DrupalSolrOfflineUnitTestCase::testContentFilters in tests/apachesolr_base.test
Test ordering of parsed filter positions.
1 string reference to 'apachesolr_clean_text'
apachesolr_fields_default_indexing_callback in ./apachesolr.index.inc
Callback that converts list module field into an array For every multivalued value we also add a single value to be able to use the stats

File

./apachesolr.module, line 2187
Integration with the Apache Solr search application.

Code

function apachesolr_clean_text($text) {

  // Remove invisible content.
  $text = preg_replace('@<(applet|audio|canvas|command|embed|iframe|map|menu|noembed|noframes|noscript|script|style|svg|video)[^>]*>.*</\\1>@siU', ' ', $text);

  // Add spaces before stripping tags to avoid running words together.
  $text = filter_xss(str_replace(array(
    '<',
    '>',
  ), array(
    ' <',
    '> ',
  ), $text), array());

  // Decode entities and then make safe any < or > characters.
  $text = htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_QUOTES, 'UTF-8');

  // Remove extra spaces.
  $text = preg_replace('/\\s+/s', ' ', $text);

  // Remove white spaces around punctuation marks probably added
  // by the safety operations above. This is not a world wide perfect solution,
  // but a rough attempt for at least US and Western Europe.
  // Pc: Connector punctuation
  // Pd: Dash punctuation
  // Pe: Close punctuation
  // Pf: Final punctuation
  // Pi: Initial punctuation
  // Po: Other punctuation, including ¿?¡!,.:;
  // Ps: Open punctuation
  $text = preg_replace('/\\s(\\p{Pc}|\\p{Pd}|\\p{Pe}|\\p{Pf}|!|\\?|,|\\.|:|;)/s', '$1', $text);
  $text = preg_replace('/(\\p{Ps}|¿|¡)\\s/s', '$1', $text);
  return $text;
}