You are here

public function PdftotextExtractor::extract in Search API attachments 9.0.x

Same name and namespace in other branches
  1. 8 src/Plugin/search_api_attachments/PdftotextExtractor.php \Drupal\search_api_attachments\Plugin\search_api_attachments\PdftotextExtractor::extract()

Extract file with Pdftotext command line tool.

Parameters

\Drupal\file\Entity\File $file: A file object.

Return value

string The text extracted from the file.

Overrides TextExtractorPluginBase::extract

File

src/Plugin/search_api_attachments/PdftotextExtractor.php, line 29

Class

PdftotextExtractor
Provides pdftotext extractor.

Namespace

Drupal\search_api_attachments\Plugin\search_api_attachments

Code

public function extract(File $file) {
  if (in_array($file
    ->getMimeType(), $this
    ->getPdfMimeTypes())) {
    $output = '';
    $pdftotext_path = $this->configuration['pdftotext_path'];
    $filepath = $this
      ->getRealpath($file
      ->getFileUri());

    // UTF-8 multibyte characters will be stripped by escapeshellargs() for
    // the default C-locale.
    // So temporarily set the locale to UTF-8 so that the filepath remains
    // valid.
    $backup_locale = setlocale(LC_CTYPE, '0');
    setlocale(LC_CTYPE, 'en_US.UTF-8');

    // Pdftotext descriptions states that '-' as text-file will send text to
    // stdout.
    $cmd = escapeshellcmd($pdftotext_path) . ' ' . escapeshellarg($filepath) . ' -';

    // Restore the locale.
    setlocale(LC_CTYPE, $backup_locale);

    // Support UTF-8 commands.
    // @see http://www.php.net/manual/en/function.shell-exec.php#85095
    shell_exec("LANG=en_US.utf-8");
    $output = shell_exec($cmd);
    if (is_null($output)) {
      throw new \Exception('Pdftotext Exctractor is not available.');
    }
    return $output;
  }
  else {
    return NULL;
  }
}