You are here

function _asa_get_attachment_text in Apache Solr Attachments 5

Parse the Attachment getting just the raw text, stripping any garbage characters that could screw up the XML Doc processing.

1 call to _asa_get_attachment_text()
apachesolr_attachments_update_index in ./apachesolr_attachments.module
Hook is called by search.module to add things to the search index. In our case we will search content types and add any CCK type that is a file type that we know how to parse and any uploaded file attachments.

File

./apachesolr_attachments.module, line 239
Provides a file attachment search implementation for use with the Apache Solr module

Code

function _asa_get_attachment_text($file) {
  $helper_command = _asa_get_file_helper_command($file->filemime);

  // Empty entries in settings mean that helper is disabled.
  if ($helper_command == '') {
    return '';
  }

  // %file% is a token that is placed in the helper's parameter list to represent
  // the file path to the attachment.
  $helper_command = preg_replace('/%file%/', "{$file->filepath}", $helper_command);
  $helper_command = escapeshellcmd($helper_command);
  $text = shell_exec($helper_command);

  // Strip anything that might make the Solr integration barf.
  // Wierd control characters make things behave wierd, especially in XML
  $cleaned_text = iconv("utf-8", "utf-8//IGNORE", $text);

  // As per robertDouglass - http://drupal.org/node/335871
  // Bad control character. Do we need to make a hook for text cleanup?
  $cleaned_text = preg_replace('/\\x0C/', '', $cleaned_text);
  return $cleaned_text;
}