function apachesolr_attachments_extract_using_tika in Apache Solr Attachments 7
Same name and namespace in other branches
- 6.3 apachesolr_attachments.index.inc \apachesolr_attachments_extract_using_tika()
- 6 apachesolr_attachments.admin.inc \apachesolr_attachments_extract_using_tika()
- 6.2 apachesolr_attachments.admin.inc \apachesolr_attachments_extract_using_tika()
For a file path, try to extract text using a local tika jar.
Throws
Exception
1 call to apachesolr_attachments_extract_using_tika()
- apachesolr_attachments_get_attachment_text in ./
apachesolr_attachments.index.inc - Parse the attachment getting just the raw text.
File
- ./
apachesolr_attachments.index.inc, line 121 - Indexing-related functions.
Code
function apachesolr_attachments_extract_using_tika($filepath) {
$tika_path = realpath(variable_get('apachesolr_attachments_tika_path', ''));
$tika = realpath($tika_path . '/' . variable_get('apachesolr_attachments_tika_jar', 'tika-app-1.1.jar'));
if (!$tika || !is_file($tika)) {
throw new Exception(t('Invalid path or filename for tika application jar.'));
}
$cmd = '';
// Add a work-around for a MAMP bug + java 1.5.
if (strpos(ini_get('extension_dir'), 'MAMP/')) {
$cmd .= 'export DYLD_LIBRARY_PATH=""; ';
}
// Support UTF-8 encoded filenames.
if (mb_detect_encoding($filepath, 'ASCII,UTF-8', TRUE) == 'UTF-8') {
$cmd .= 'export LANG="en_US.utf-8"; ';
setlocale(LC_CTYPE, 'UTF8', 'en_US.UTF-8');
}
// By default force UTF-8 output.
$cmd .= escapeshellcmd(variable_get('apachesolr_attachments_java', 'java')) . ' ' . escapeshellarg(variable_get('apachesolr_attachments_java_opts', '-Dfile.encoding=UTF8')) . ' -cp ' . escapeshellarg($tika_path) . ' -jar ' . escapeshellarg($tika) . ' -t ' . escapeshellarg($filepath);
return shell_exec($cmd);
}