public function TikaExtractor::extract in Search API attachments 9.0.x
Same name and namespace in other branches
- 8 src/Plugin/search_api_attachments/TikaExtractor.php \Drupal\search_api_attachments\Plugin\search_api_attachments\TikaExtractor::extract()
Extract file with Tika library.
Parameters
\Drupal\file\Entity\File $file: A file object.
Return value
string The text extracted from the file.
Overrides TextExtractorPluginBase::extract
File
- src/
Plugin/ search_api_attachments/ TikaExtractor.php, line 29
Class
- TikaExtractor
- Provides tika extractor.
Namespace
Drupal\search_api_attachments\Plugin\search_api_attachmentsCode
public function extract(File $file) {
$output = '';
$filepath = $this
->getRealpath($file
->getFileUri());
$tika = realpath($this->configuration['tika_path']);
$java = $this->configuration['java_path'];
// UTF-8 multibyte characters will be stripped by escapeshellargs() for the
// default C-locale.
// So temporarily set the locale to UTF-8 so that the filepath remains
// valid.
$backup_locale = setlocale(LC_CTYPE, '0');
setlocale(LC_CTYPE, 'en_US.UTF-8');
$param = '';
if ($file
->getMimeType() != 'audio/mpeg') {
$param = ' -Dfile.encoding=UTF8 -cp ' . escapeshellarg($tika);
}
// Force running the Tika jar headless.
$param = ' -Djava.awt.headless=true ' . $param;
$cmd = $java . $param . ' -jar ' . escapeshellarg($tika) . ' -t ' . escapeshellarg($filepath);
if (strpos(ini_get('extension_dir'), 'MAMP/')) {
$cmd = 'export DYLD_LIBRARY_PATH=""; ' . $cmd;
}
// Restore the locale.
setlocale(LC_CTYPE, $backup_locale);
// Support UTF-8 commands:
// @see http://www.php.net/manual/en/function.shell-exec.php#85095
shell_exec("LANG=en_US.utf-8");
$output = shell_exec($cmd);
if (is_null($output)) {
throw new \Exception('Tika Exctractor is not available.');
}
return $output;
}