function _asa_get_attachment_text in Apache Solr Attachments 5
Parse the Attachment getting just the raw text, stripping any garbage characters that could screw up the XML Doc processing.
1 call to _asa_get_attachment_text()
- apachesolr_attachments_update_index in ./
apachesolr_attachments.module - Hook is called by search.module to add things to the search index. In our case we will search content types and add any CCK type that is a file type that we know how to parse and any uploaded file attachments.
File
- ./
apachesolr_attachments.module, line 239 - Provides a file attachment search implementation for use with the Apache Solr module
Code
function _asa_get_attachment_text($file) {
$helper_command = _asa_get_file_helper_command($file->filemime);
// Empty entries in settings mean that helper is disabled.
if ($helper_command == '') {
return '';
}
// %file% is a token that is placed in the helper's parameter list to represent
// the file path to the attachment.
$helper_command = preg_replace('/%file%/', "{$file->filepath}", $helper_command);
$helper_command = escapeshellcmd($helper_command);
$text = shell_exec($helper_command);
// Strip anything that might make the Solr integration barf.
// Wierd control characters make things behave wierd, especially in XML
$cleaned_text = iconv("utf-8", "utf-8//IGNORE", $text);
// As per robertDouglass - http://drupal.org/node/335871
// Bad control character. Do we need to make a hook for text cleanup?
$cleaned_text = preg_replace('/\\x0C/', '', $cleaned_text);
return $cleaned_text;
}