public static function WordPressBlog::preprocessFile in WordPress Migrate 7.2
WXR files typically need some cleanup to be successfully parsed - perform that here.
Parameters
$sourcefile: The raw WXR file as uploaded.
$destination: Filespec to which to write the cleaned-up WXR file. Omit when $namespaces_only == TRUE.
bool $unlink: Indicates whether $sourcefile will be deleted after preprocessing.
bool $namespaces_only: When TRUE, do not rewrite the file, simply gather and return the namespaces.
Return value
array List of referenced namespaces, keyed by prefix.
2 calls to WordPressBlog::preprocessFile()
- WordPressMigrateWizard::sourceDataFormValidate in ./
wordpress_migrate.migrate.inc - Fetch and preprocess the uploaded WXR file.
- wordpress_migrate_update_7015 in ./
wordpress_migrate.install - Updates to legacy wordpress migration arguments.
File
- ./
wordpress.inc, line 357 - Implementation of migration from WordPress into Drupal
Class
Code
public static function preprocessFile($sourcefile, $destination, $unlink = TRUE, $namespaces_only = FALSE) {
// Cleanup some stuff in the process of moving the file to its final
// destination
$source_handle = fopen($sourcefile, 'r');
if (!$namespaces_only) {
$dest_handle = fopen($destination, 'w');
}
// First, get the header (everything before the <channel> element) to
// rewrite the namespaces (skipping any empty lines).
$header = '';
while (($line = fgets($source_handle)) !== FALSE) {
if (trim($line)) {
$header .= $line;
if (strpos($line, '<channel>') !== FALSE) {
break;
}
}
}
// The excerpt namespace is sometimes omitted, stuff it in if necessary
$excerpt_ns = 'xmlns:excerpt="http://wordpress.org/export/1.0/excerpt/"';
$excerpt_signature = 'xmlns:excerpt="http://wordpress.org/export/';
$content_ns = 'xmlns:content="http://purl.org/rss/1.0/modules/content/"';
if (!strpos($header, $excerpt_signature)) {
$header = str_replace($content_ns, $excerpt_ns . "\n\t" . $content_ns, $header);
}
// Add the Atom namespace, in case it's referenced
$atom_ns = 'xmlns:atom="http://www.w3.org/2005/Atom"';
$header = str_replace($content_ns, $atom_ns . "\n\t" . $content_ns, $header);
// What the hell, throw in iTunes too
$itunes_ns = 'xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"';
$header = str_replace($content_ns, $itunes_ns . "\n\t" . $content_ns, $header);
preg_match_all('|xmlns:(.+?)="(.+?)"|i', $header, $matches, PREG_SET_ORDER);
$namespaces = array();
foreach ($matches as $index => $match) {
$namespaces[$match[1]] = $match[2];
}
if ($namespaces_only) {
return $namespaces;
}
// Replace HTML entities with XML entities
$header = strtr($header, self::$entityReplacements);
fputs($dest_handle, $header);
// Now, do some line-by-line fix-ups fix unencoded ampersands and bogus characters on a line-by-line basis
while ($line = fgets($source_handle)) {
// Handle unencoded ampersands
$line = preg_replace('/&(?![\\w\\d#]+;)/', '&', $line);
// Remove control characters (the regex removes the newline, so tack it back on)
$line = preg_replace('~\\p{C}+~u', '', $line) . "\n";
// WordPress export doesn't properly format embedded CDATA sections - our
// quick-and-dirty fix is to remove the terminator of the embedded section
$line = preg_replace('|// \\]\\]|', '', $line);
// Replace HTML entities with XML entities
$line = strtr($line, self::$entityReplacements);
fputs($dest_handle, $line);
}
fclose($dest_handle);
fclose($source_handle);
if ($unlink) {
unlink($sourcefile);
}
return $namespaces;
}