You are here

class PdftotextExtractor in Search API attachments 8

Same name and namespace in other branches
  1. 9.0.x src/Plugin/search_api_attachments/PdftotextExtractor.php \Drupal\search_api_attachments\Plugin\search_api_attachments\PdftotextExtractor

Provides pdftotext extractor.

Plugin annotation


@SearchApiAttachmentsTextExtractor(
  id = "pdftotext_extractor",
  label = @Translation("Pdftotext Extractor"),
  description = @Translation("Adds Pdftotext extractor support."),
)

Hierarchy

Expanded class hierarchy of PdftotextExtractor

File

src/Plugin/search_api_attachments/PdftotextExtractor.php, line 18

Namespace

Drupal\search_api_attachments\Plugin\search_api_attachments
View source
class PdftotextExtractor extends TextExtractorPluginBase {

  /**
   * Extract file with Pdftotext command line tool.
   *
   * @param \Drupal\file\Entity\File $file
   *   A file object.
   *
   * @return string
   *   The text extracted from the file.
   */
  public function extract(File $file) {
    if (in_array($file
      ->getMimeType(), $this
      ->getPdfMimeTypes())) {
      $output = '';
      $pdftotext_path = $this->configuration['pdftotext_path'];
      $filepath = $this
        ->getRealpath($file
        ->getFileUri());

      // UTF-8 multibyte characters will be stripped by escapeshellargs() for
      // the default C-locale.
      // So temporarily set the locale to UTF-8 so that the filepath remains
      // valid.
      $backup_locale = setlocale(LC_CTYPE, '0');
      setlocale(LC_CTYPE, 'en_US.UTF-8');

      // Pdftotext descriptions states that '-' as text-file will send text to
      // stdout.
      $cmd = escapeshellcmd($pdftotext_path) . ' ' . escapeshellarg($filepath) . ' -';

      // Restore the locale.
      setlocale(LC_CTYPE, $backup_locale);

      // Support UTF-8 commands.
      // @see http://www.php.net/manual/en/function.shell-exec.php#85095
      shell_exec("LANG=en_US.utf-8");
      $output = shell_exec($cmd);
      if (is_null($output)) {
        throw new \Exception('Pdftotext Exctractor is not available.');
      }
      return $output;
    }
    else {
      return NULL;
    }
  }

  /**
   * {@inheritdoc}
   */
  public function buildConfigurationForm(array $form, FormStateInterface $form_state) {
    $form['pdftotext_path'] = [
      '#type' => 'textfield',
      '#title' => $this
        ->t('Pdftotext binary'),
      '#description' => $this
        ->t('Enter the name of pdftotext executable or the full path to the pdftotext binary. Example: "pdftotext" or "/usr/bin/pdftotext".'),
      '#default_value' => $this->configuration['pdftotext_path'],
      '#required' => TRUE,
    ];
    return $form;
  }

  /**
   * {@inheritdoc}
   */
  public function validateConfigurationForm(array &$form, FormStateInterface $form_state) {
    $values = $form_state
      ->getValue([
      'text_extractor_config',
    ]);
    $pdftotext_path = $values['pdftotext_path'];
    $is_name = strpos($pdftotext_path, '/') === FALSE && strpos($pdftotext_path, '\\') === FALSE;
    if (!$is_name && !file_exists($pdftotext_path)) {
      $form_state
        ->setError($form['text_extractor_config']['pdftotext_path'], $this
        ->t('The file %path does not exist.', [
        '%path' => $pdftotext_path,
      ]));
    }
  }

  /**
   * {@inheritdoc}
   */
  public function submitConfigurationForm(array &$form, FormStateInterface $form_state) {
    $this->configuration['pdftotext_path'] = $form_state
      ->getValue([
      'text_extractor_config',
      'pdftotext_path',
    ]);
    parent::submitConfigurationForm($form, $form_state);
  }

}

Members

Namesort descending Modifiers Type Description Overrides
DependencySerializationTrait::$_entityStorages protected property An array of entity type IDs keyed by the property name of their storages.
DependencySerializationTrait::$_serviceIds protected property An array of service IDs keyed by property name used for serialization.
DependencySerializationTrait::__sleep public function 1
DependencySerializationTrait::__wakeup public function 2
MessengerTrait::messenger public function Gets the messenger. 29
MessengerTrait::setMessenger public function Sets the messenger.
PdftotextExtractor::buildConfigurationForm public function Form constructor. Overrides PluginFormInterface::buildConfigurationForm
PdftotextExtractor::extract public function Extract file with Pdftotext command line tool. Overrides TextExtractorPluginBase::extract
PdftotextExtractor::submitConfigurationForm public function Form submission handler. Overrides TextExtractorPluginBase::submitConfigurationForm
PdftotextExtractor::validateConfigurationForm public function Form validation handler. Overrides TextExtractorPluginBase::validateConfigurationForm
PluginBase::$configuration protected property Configuration information passed into the plugin. 1
PluginBase::$pluginDefinition protected property The plugin implementation definition. 1
PluginBase::$pluginId protected property The plugin_id.
PluginBase::DERIVATIVE_SEPARATOR constant A string which is used to separate base plugin IDs from the derivative ID.
PluginBase::getBaseId public function Gets the base_plugin_id of the plugin instance. Overrides DerivativeInspectionInterface::getBaseId
PluginBase::getDerivativeId public function Gets the derivative_id of the plugin instance. Overrides DerivativeInspectionInterface::getDerivativeId
PluginBase::getPluginDefinition public function Gets the definition of the plugin implementation. Overrides PluginInspectionInterface::getPluginDefinition 3
PluginBase::getPluginId public function Gets the plugin_id of the plugin instance. Overrides PluginInspectionInterface::getPluginId
PluginBase::isConfigurable public function Determines if the plugin is configurable.
StringTranslationTrait::$stringTranslation protected property The string translation service. 1
StringTranslationTrait::formatPlural protected function Formats a string containing a count of items.
StringTranslationTrait::getNumberOfPlurals protected function Returns the number of plurals supported by a given language.
StringTranslationTrait::getStringTranslation protected function Gets the string translation service.
StringTranslationTrait::setStringTranslation public function Sets the string translation service to use. 2
StringTranslationTrait::t protected function Translates a string to the current language or to a given language.
TextExtractorPluginBase::$configFactory protected property Config factory service.
TextExtractorPluginBase::$messenger protected property The messenger. Overrides MessengerTrait::$messenger
TextExtractorPluginBase::$mimeTypeGuesser protected property Mime type guesser service.
TextExtractorPluginBase::$streamWrapperManager protected property Stream wrapper manager service.
TextExtractorPluginBase::calculateDependencies public function
TextExtractorPluginBase::CONFIGNAME constant Name of the config being edited.
TextExtractorPluginBase::create public static function Creates an instance of the plugin. Overrides ContainerFactoryPluginInterface::create 2
TextExtractorPluginBase::defaultConfiguration public function Gets default configuration for this plugin. Overrides ConfigurableInterface::defaultConfiguration
TextExtractorPluginBase::getConfiguration public function Gets this plugin's configuration. Overrides ConfigurableInterface::getConfiguration
TextExtractorPluginBase::getmessenger public function
TextExtractorPluginBase::getPdfMimeTypes public function Helper method to get the PDF MIME types.
TextExtractorPluginBase::getRealpath public function Helper method to get the real path from an uri.
TextExtractorPluginBase::setConfiguration public function Sets the configuration for this plugin instance. Overrides ConfigurableInterface::setConfiguration
TextExtractorPluginBase::__construct public function Constructs a \Drupal\Component\Plugin\PluginBase object. Overrides PluginBase::__construct 2