You are here

public function PythonPdf2txtExtractor::extract in Search API attachments 9.0.x

Same name and namespace in other branches
  1. 8 src/Plugin/search_api_attachments/PythonPdf2txtExtractor.php \Drupal\search_api_attachments\Plugin\search_api_attachments\PythonPdf2txtExtractor::extract()

Extract file with python Pdf2txt library.

Parameters

\Drupal\file\Entity\File $file: A file object.

Return value

string The text extracted from the file.

Overrides TextExtractorPluginBase::extract

File

src/Plugin/search_api_attachments/PythonPdf2txtExtractor.php, line 29

Class

PythonPdf2txtExtractor
Provides python pdf2text extractor.

Namespace

Drupal\search_api_attachments\Plugin\search_api_attachments

Code

public function extract(File $file) {
  if (in_array($file
    ->getMimeType(), $this
    ->getPdfMimeTypes())) {
    $output = '';
    $filepath = $this
      ->getRealpath($file
      ->getFileUri());

    // Restore the locale.
    $python_path = $this->configuration['python_path'];
    $python_pdf2txt_script = realpath($this->configuration['python_pdf2txt_script']);
    $cmd = escapeshellcmd($python_path) . ' ' . escapeshellarg($python_pdf2txt_script) . ' -C -t text ' . escapeshellarg($filepath);

    // UTF-8 multibyte characters will be stripped by escapeshellargs() for
    // the default C-locale.
    // So temporarily set the locale to UTF-8 so that the filepath remains
    // valid.
    $backup_locale = setlocale(LC_CTYPE, '0');
    setlocale(LC_CTYPE, $backup_locale);

    // Support UTF-8 commands.
    // @see http://www.php.net/manual/en/function.shell-exec.php#85095
    shell_exec("LANG=en_US.utf-8");
    $output = shell_exec($cmd);
    if (is_null($output)) {
      throw new \Exception('Python Pdf2txt Exctractor is not available.');
    }
    return $output;
  }
  else {
    return NULL;
  }
}