You are here

function apachesolr_attachments_extract_using_tika in Apache Solr Attachments 6.3

Same name and namespace in other branches
  1. 6 apachesolr_attachments.admin.inc \apachesolr_attachments_extract_using_tika()
  2. 6.2 apachesolr_attachments.admin.inc \apachesolr_attachments_extract_using_tika()
  3. 7 apachesolr_attachments.index.inc \apachesolr_attachments_extract_using_tika()

For a file path, try to extract text using a local tika jar.

Throws

Exception

1 call to apachesolr_attachments_extract_using_tika()
apachesolr_attachments_get_attachment_text in ./apachesolr_attachments.index.inc
Parse the attachment getting just the raw text.

File

./apachesolr_attachments.index.inc, line 111
Indexing-related functions.

Code

function apachesolr_attachments_extract_using_tika($filepath) {
  $tika_path = realpath(variable_get('apachesolr_attachments_tika_path', ''));
  $tika = realpath($tika_path . '/' . variable_get('apachesolr_attachments_tika_jar', 'tika-app-1.1.jar'));
  if (!$tika || !is_file($tika)) {
    throw new Exception(t('Invalid path or filename for tika application jar.'));
  }
  $cmd = '';

  // Add a work-around for a MAMP bug + java 1.5.
  if (strpos(ini_get('extension_dir'), 'MAMP/')) {
    $cmd .= 'export DYLD_LIBRARY_PATH=""; ';
  }

  // Support UTF-8 encoded filenames.
  if (mb_detect_encoding($filepath, 'ASCII,UTF-8', TRUE) == 'UTF-8') {
    $cmd .= 'export LANG="en_US.utf-8"; ';
    setlocale(LC_CTYPE, 'UTF8', 'en_US.UTF-8');
  }

  // By default force UTF-8 output.
  $cmd .= escapeshellcmd(variable_get('apachesolr_attachments_java', 'java')) . ' ' . escapeshellarg(variable_get('apachesolr_attachments_java_opts', '-Dfile.encoding=UTF8')) . ' -cp ' . escapeshellarg($tika_path) . ' -jar ' . escapeshellarg($tika) . ' -t ' . escapeshellarg($filepath);
  return shell_exec($cmd);
}