class FeedsExQueryPathHtml in Feeds extensible parsers 7.2
Same name and namespace in other branches
- 7 src/FeedsExQueryPathHtml.inc \FeedsExQueryPathHtml
Parses HTML documents with QueryPath.
@todo Make convertEncoding() into a helper function so that they aren't \ copied in 2 places.
Hierarchy
- class \FeedsExBase extends \FeedsParser
- class \FeedsExXml
- class \FeedsExQueryPathXml
- class \FeedsExQueryPathHtml
- class \FeedsExQueryPathXml
- class \FeedsExXml
Expanded class hierarchy of FeedsExQueryPathHtml
5 string references to 'FeedsExQueryPathHtml'
- FeedsExQueryPathHtmlUnitTests::testAttributeParsing in src/
Tests/ FeedsExQueryPathHtml.test - Tests grabbing an attribute.
- FeedsExQueryPathHtmlUnitTests::testCP866Encoded in src/
Tests/ FeedsExQueryPathHtml.test - Tests parsing a CP866 (Russian) encoded file.
- FeedsExQueryPathHtmlUnitTests::testEUCJPEncodedNoDeclaration in src/
Tests/ FeedsExQueryPathHtml.test - Tests a EUC-JP (Japanese) encoded file without the encoding declaration.
- FeedsExQueryPathHtmlUnitTests::testSimpleParsing in src/
Tests/ FeedsExQueryPathHtml.test - Tests simple parsing.
- feeds_ex_feeds_plugins in ./
feeds_ex.feeds.inc - Implements hook_feeds_plugins().
File
- src/
FeedsExQueryPathHtml.inc, line 14 - Contains FeedsExQueryPathHtml.
View source
class FeedsExQueryPathHtml extends FeedsExQueryPathXml {
/**
* {@inheritdoc}
*/
protected function setUp(FeedsSource $source, FeedsFetcherResult $fetcher_result) {
// Change some parser settings.
$this->queryPathOptions['use_parser'] = 'html';
}
/**
* {@inheritdoc}
*/
protected function getRawValue(QueryPath $node) {
return $node
->html();
}
/**
* {@inheritdoc}
*/
protected function convertEncoding($data, $encoding = 'UTF-8') {
// Check for an encoding declaration.
$matches = FALSE;
if (preg_match('/<meta[^>]+charset\\s*=\\s*["\']?([\\w-]+)\\b/i', $data, $matches)) {
$encoding = $matches[1];
}
elseif ($detected = parent::detectEncoding($data)) {
$encoding = $detected;
}
// Unsupported encodings are converted here into UTF-8.
$php_supported = array(
'utf-8',
'us-ascii',
'ascii',
);
if (in_array(strtolower($encoding), $php_supported)) {
return $data;
}
$data = parent::convertEncoding($data, $encoding);
if ($matches) {
$data = preg_replace('/(<meta[^>]+charset\\s*=\\s*["\']?)([\\w-]+)\\b/i', '$1UTF-8', $data, 1);
}
return $data;
}
/**
* {@inheritdoc}
*/
protected function prepareDocument(FeedsSource $source, FeedsFetcherResult $fetcher_result) {
$raw = $fetcher_result
->getRaw();
if (!strlen(trim($raw))) {
throw new FeedsExEmptyException();
}
$raw = $this
->convertEncoding($raw);
if ($this->config['use_tidy'] && extension_loaded('tidy')) {
$raw = tidy_repair_string($raw, $this
->getTidyConfig(), 'utf8');
}
return FeedsExXmlUtility::createHtmlDocument($raw);
}
/**
* {@inheritdoc}
*/
protected function getTidyConfig() {
return array(
'merge-divs' => FALSE,
'merge-spans' => FALSE,
'join-styles' => FALSE,
'drop-empty-paras' => FALSE,
'wrap' => 0,
'tidy-mark' => FALSE,
'escape-cdata' => TRUE,
'word-2000' => TRUE,
);
}
}
Members
Name | Modifiers | Type | Description | Overrides |
---|---|---|---|---|
FeedsExBase:: |
protected | property | Whether the current system handles mb_* functions. | |
FeedsExBase:: |
protected | property | The object used to display messages to the user. | |
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
protected | function | Renders our debug messages into a list. | |
FeedsExBase:: |
protected | function | Delegates parsing to the subclass. | |
FeedsExBase:: |
protected | function | Detects the encoding of a string. | |
FeedsExBase:: |
protected | function | Executes the source expressions. | |
FeedsExBase:: |
protected | function | Returns the configuration form table header. | |
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
public | function | Returns the messenger. | |
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
protected | function | Returns whether or not this parser uses a context query. | 2 |
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
protected | function | Loads the necessary library. | 2 |
FeedsExBase:: |
protected | function | Logs errors. | |
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
protected | function | Prepares the expressions for parsing. | |
FeedsExBase:: |
protected | function | Prepares the variable map used to substitution. | |
FeedsExBase:: |
protected | function | Prints errors to the screen. | |
FeedsExBase:: |
public | function | Sets the messenger to be used to display messages. | |
FeedsExBase:: |
public | function | Sets the multibyte handling. | |
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
public | function | ||
FeedsExBase:: |
protected | function | 1 | |
FeedsExQueryPathHtml:: |
protected | function |
Converts a string to UTF-8. Overrides FeedsExXml:: |
|
FeedsExQueryPathHtml:: |
protected | function |
Returns the raw value. Overrides FeedsExQueryPathXml:: |
|
FeedsExQueryPathHtml:: |
protected | function |
Returns the options for phptidy. Overrides FeedsExXml:: |
|
FeedsExQueryPathHtml:: |
protected | function |
Prepares the DOM document. Overrides FeedsExXml:: |
|
FeedsExQueryPathHtml:: |
protected | function |
Allows subclasses to prepare for parsing. Overrides FeedsExXml:: |
|
FeedsExQueryPathXml:: |
protected | property | Options passed to QueryPath. | |
FeedsExQueryPathXml:: |
protected | function |
Returns a form element for a specific column. Overrides FeedsExXml:: |
|
FeedsExQueryPathXml:: |
protected | function |
Reuturns the list of table headers. Overrides FeedsExXml:: |
|
FeedsExQueryPathXml:: |
protected | function |
Returns rows to be parsed. Overrides FeedsExXml:: |
|
FeedsExQueryPathXml:: |
protected | function |
Executes a single source expression. Overrides FeedsExXml:: |
|
FeedsExQueryPathXml:: |
protected | function |
Validates an expression. Overrides FeedsExXml:: |
|
FeedsExXml:: |
protected | property | The previous value for the entity loader. | |
FeedsExXml:: |
protected | property | The previous value for XML error handling. | |
FeedsExXml:: |
protected | property | The FeedsExXpathDomXpath object used for parsing. | |
FeedsExXml:: |
protected | function |
Allows subclasses to cleanup after parsing. Overrides FeedsExBase:: |
|
FeedsExXml:: |
public | function |
Overrides FeedsExBase:: |
|
FeedsExXml:: |
public | function |
Overrides FeedsExBase:: |
|
FeedsExXml:: |
protected | function |
Returns the errors after parsing. Overrides FeedsExBase:: |
|
FeedsExXml:: |
protected | function | Returns the raw XML of a DOM node. | 1 |
FeedsExXml:: |
protected | function |
Starts internal error handling. Overrides FeedsExBase:: |
|
FeedsExXml:: |
protected | function |
Stops internal error handling. Overrides FeedsExBase:: |