You are here

class FeedsExQueryPathHtml in Feeds extensible parsers 7.2

Same name and namespace in other branches
  1. 7 src/FeedsExQueryPathHtml.inc \FeedsExQueryPathHtml

Parses HTML documents with QueryPath.

@todo Make convertEncoding() into a helper function so that they aren't \ copied in 2 places.

Hierarchy

Expanded class hierarchy of FeedsExQueryPathHtml

5 string references to 'FeedsExQueryPathHtml'
FeedsExQueryPathHtmlUnitTests::testAttributeParsing in src/Tests/FeedsExQueryPathHtml.test
Tests grabbing an attribute.
FeedsExQueryPathHtmlUnitTests::testCP866Encoded in src/Tests/FeedsExQueryPathHtml.test
Tests parsing a CP866 (Russian) encoded file.
FeedsExQueryPathHtmlUnitTests::testEUCJPEncodedNoDeclaration in src/Tests/FeedsExQueryPathHtml.test
Tests a EUC-JP (Japanese) encoded file without the encoding declaration.
FeedsExQueryPathHtmlUnitTests::testSimpleParsing in src/Tests/FeedsExQueryPathHtml.test
Tests simple parsing.
feeds_ex_feeds_plugins in ./feeds_ex.feeds.inc
Implements hook_feeds_plugins().

File

src/FeedsExQueryPathHtml.inc, line 14
Contains FeedsExQueryPathHtml.

View source
class FeedsExQueryPathHtml extends FeedsExQueryPathXml {

  /**
   * {@inheritdoc}
   */
  protected function setUp(FeedsSource $source, FeedsFetcherResult $fetcher_result) {

    // Change some parser settings.
    $this->queryPathOptions['use_parser'] = 'html';
  }

  /**
   * {@inheritdoc}
   */
  protected function getRawValue(QueryPath $node) {
    return $node
      ->html();
  }

  /**
   * {@inheritdoc}
   */
  protected function convertEncoding($data, $encoding = 'UTF-8') {

    // Check for an encoding declaration.
    $matches = FALSE;
    if (preg_match('/<meta[^>]+charset\\s*=\\s*["\']?([\\w-]+)\\b/i', $data, $matches)) {
      $encoding = $matches[1];
    }
    elseif ($detected = parent::detectEncoding($data)) {
      $encoding = $detected;
    }

    // Unsupported encodings are converted here into UTF-8.
    $php_supported = array(
      'utf-8',
      'us-ascii',
      'ascii',
    );
    if (in_array(strtolower($encoding), $php_supported)) {
      return $data;
    }
    $data = parent::convertEncoding($data, $encoding);
    if ($matches) {
      $data = preg_replace('/(<meta[^>]+charset\\s*=\\s*["\']?)([\\w-]+)\\b/i', '$1UTF-8', $data, 1);
    }
    return $data;
  }

  /**
   * {@inheritdoc}
   */
  protected function prepareDocument(FeedsSource $source, FeedsFetcherResult $fetcher_result) {
    $raw = $fetcher_result
      ->getRaw();
    if (!strlen(trim($raw))) {
      throw new FeedsExEmptyException();
    }
    $raw = $this
      ->convertEncoding($raw);
    if ($this->config['use_tidy'] && extension_loaded('tidy')) {
      $raw = tidy_repair_string($raw, $this
        ->getTidyConfig(), 'utf8');
    }
    return FeedsExXmlUtility::createHtmlDocument($raw);
  }

  /**
   * {@inheritdoc}
   */
  protected function getTidyConfig() {
    return array(
      'merge-divs' => FALSE,
      'merge-spans' => FALSE,
      'join-styles' => FALSE,
      'drop-empty-paras' => FALSE,
      'wrap' => 0,
      'tidy-mark' => FALSE,
      'escape-cdata' => TRUE,
      'word-2000' => TRUE,
    );
  }

}

Members

Namesort descending Modifiers Type Description Overrides
FeedsExBase::$isMultibyte protected property Whether the current system handles mb_* functions.
FeedsExBase::$messenger protected property The object used to display messages to the user.
FeedsExBase::configFormValidate public function
FeedsExBase::debug protected function Renders our debug messages into a list.
FeedsExBase::delegateParsing protected function Delegates parsing to the subclass.
FeedsExBase::detectEncoding protected function Detects the encoding of a string.
FeedsExBase::executeSources protected function Executes the source expressions.
FeedsExBase::getFormHeader protected function Returns the configuration form table header.
FeedsExBase::getMappingSources public function
FeedsExBase::getMessenger public function Returns the messenger.
FeedsExBase::hasConfigForm public function
FeedsExBase::hasConfigurableContext protected function Returns whether or not this parser uses a context query. 2
FeedsExBase::hasSourceConfig public function
FeedsExBase::loadLibrary protected function Loads the necessary library. 2
FeedsExBase::logErrors protected function Logs errors.
FeedsExBase::parse public function
FeedsExBase::prepareExpressions protected function Prepares the expressions for parsing.
FeedsExBase::prepareVariables protected function Prepares the variable map used to substitution.
FeedsExBase::printErrors protected function Prints errors to the screen.
FeedsExBase::setMessenger public function Sets the messenger to be used to display messages.
FeedsExBase::setMultibyte public function Sets the multibyte handling.
FeedsExBase::sourceDefaults public function
FeedsExBase::sourceForm public function
FeedsExBase::sourceFormValidate public function
FeedsExBase::sourceSave public function
FeedsExBase::__construct protected function 1
FeedsExQueryPathHtml::convertEncoding protected function Converts a string to UTF-8. Overrides FeedsExXml::convertEncoding
FeedsExQueryPathHtml::getRawValue protected function Returns the raw value. Overrides FeedsExQueryPathXml::getRawValue
FeedsExQueryPathHtml::getTidyConfig protected function Returns the options for phptidy. Overrides FeedsExXml::getTidyConfig
FeedsExQueryPathHtml::prepareDocument protected function Prepares the DOM document. Overrides FeedsExXml::prepareDocument
FeedsExQueryPathHtml::setUp protected function Allows subclasses to prepare for parsing. Overrides FeedsExXml::setUp
FeedsExQueryPathXml::$queryPathOptions protected property Options passed to QueryPath.
FeedsExQueryPathXml::configFormTableColumn protected function Returns a form element for a specific column. Overrides FeedsExXml::configFormTableColumn
FeedsExQueryPathXml::configFormTableHeader protected function Reuturns the list of table headers. Overrides FeedsExXml::configFormTableHeader
FeedsExQueryPathXml::executeContext protected function Returns rows to be parsed. Overrides FeedsExXml::executeContext
FeedsExQueryPathXml::executeSourceExpression protected function Executes a single source expression. Overrides FeedsExXml::executeSourceExpression
FeedsExQueryPathXml::validateExpression protected function Validates an expression. Overrides FeedsExXml::validateExpression
FeedsExXml::$entityLoader protected property The previous value for the entity loader.
FeedsExXml::$handleXmlErrors protected property The previous value for XML error handling.
FeedsExXml::$xpath protected property The FeedsExXpathDomXpath object used for parsing.
FeedsExXml::cleanUp protected function Allows subclasses to cleanup after parsing. Overrides FeedsExBase::cleanUp
FeedsExXml::configDefaults public function Overrides FeedsExBase::configDefaults
FeedsExXml::configForm public function Overrides FeedsExBase::configForm
FeedsExXml::getErrors protected function Returns the errors after parsing. Overrides FeedsExBase::getErrors
FeedsExXml::getRaw protected function Returns the raw XML of a DOM node. 1
FeedsExXml::startErrorHandling protected function Starts internal error handling. Overrides FeedsExBase::startErrorHandling
FeedsExXml::stopErrorHandling protected function Stops internal error handling. Overrides FeedsExBase::stopErrorHandling