You are here

public function TikaExtractor::extract in Search API attachments 9.0.x

Same name and namespace in other branches
  1. 8 src/Plugin/search_api_attachments/TikaExtractor.php \Drupal\search_api_attachments\Plugin\search_api_attachments\TikaExtractor::extract()

Extract file with Tika library.

Parameters

\Drupal\file\Entity\File $file: A file object.

Return value

string The text extracted from the file.

Overrides TextExtractorPluginBase::extract

File

src/Plugin/search_api_attachments/TikaExtractor.php, line 29

Class

TikaExtractor
Provides tika extractor.

Namespace

Drupal\search_api_attachments\Plugin\search_api_attachments

Code

public function extract(File $file) {
  $output = '';
  $filepath = $this
    ->getRealpath($file
    ->getFileUri());
  $tika = realpath($this->configuration['tika_path']);
  $java = $this->configuration['java_path'];

  // UTF-8 multibyte characters will be stripped by escapeshellargs() for the
  // default C-locale.
  // So temporarily set the locale to UTF-8 so that the filepath remains
  // valid.
  $backup_locale = setlocale(LC_CTYPE, '0');
  setlocale(LC_CTYPE, 'en_US.UTF-8');
  $param = '';
  if ($file
    ->getMimeType() != 'audio/mpeg') {
    $param = ' -Dfile.encoding=UTF8 -cp ' . escapeshellarg($tika);
  }

  // Force running the Tika jar headless.
  $param = ' -Djava.awt.headless=true ' . $param;
  $cmd = $java . $param . ' -jar ' . escapeshellarg($tika) . ' -t ' . escapeshellarg($filepath);
  if (strpos(ini_get('extension_dir'), 'MAMP/')) {
    $cmd = 'export DYLD_LIBRARY_PATH=""; ' . $cmd;
  }

  // Restore the locale.
  setlocale(LC_CTYPE, $backup_locale);

  // Support UTF-8 commands:
  // @see http://www.php.net/manual/en/function.shell-exec.php#85095
  shell_exec("LANG=en_US.utf-8");
  $output = shell_exec($cmd);
  if (is_null($output)) {
    throw new \Exception('Tika Exctractor is not available.');
  }
  return $output;
}