function apachesolr_clean_text in Apache Solr Search 8
Same name and namespace in other branches
- 5.2 apachesolr.index.inc \apachesolr_clean_text()
- 6.3 apachesolr.module \apachesolr_clean_text()
- 6 apachesolr.index.inc \apachesolr_clean_text()
- 6.2 apachesolr.index.inc \apachesolr_clean_text()
- 7 apachesolr.module \apachesolr_clean_text()
Strip html tags and also control characters that cause Jetty/Solr to fail.
4 calls to apachesolr_clean_text()
- apachesolr_index_add_tags_to_document in ./
apachesolr.index.inc - Extract HTML tag contents from $text and add to boost fields.
- apachesolr_index_node_solr_document in ./
apachesolr.index.inc - Builds the node-specific information for a Solr document.
- apachesolr_term_reference_indexing_callback in ./
apachesolr.index.inc - Callback that converts term_reference field into an array
- DrupalSolrOfflineUnitTestCase::testContentFilters in tests/
apachesolr_base.test - Test ordering of parsed filter positions.
1 string reference to 'apachesolr_clean_text'
- apachesolr_fields_default_indexing_callback in ./
apachesolr.index.inc - Callback that converts list module field into an array For every multivalued value we also add a single value to be able to use the stats
File
- ./
apachesolr.module, line 2187 - Integration with the Apache Solr search application.
Code
function apachesolr_clean_text($text) {
// Remove invisible content.
$text = preg_replace('@<(applet|audio|canvas|command|embed|iframe|map|menu|noembed|noframes|noscript|script|style|svg|video)[^>]*>.*</\\1>@siU', ' ', $text);
// Add spaces before stripping tags to avoid running words together.
$text = filter_xss(str_replace(array(
'<',
'>',
), array(
' <',
'> ',
), $text), array());
// Decode entities and then make safe any < or > characters.
$text = htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_QUOTES, 'UTF-8');
// Remove extra spaces.
$text = preg_replace('/\\s+/s', ' ', $text);
// Remove white spaces around punctuation marks probably added
// by the safety operations above. This is not a world wide perfect solution,
// but a rough attempt for at least US and Western Europe.
// Pc: Connector punctuation
// Pd: Dash punctuation
// Pe: Close punctuation
// Pf: Final punctuation
// Pi: Initial punctuation
// Po: Other punctuation, including ¿?¡!,.:;
// Ps: Open punctuation
$text = preg_replace('/\\s(\\p{Pc}|\\p{Pd}|\\p{Pe}|\\p{Pf}|!|\\?|,|\\.|:|;)/s', '$1', $text);
$text = preg_replace('/(\\p{Ps}|¿|¡)\\s/s', '$1', $text);
return $text;
}