public static function WordPressBlog::preprocessFile in WordPress Migrate 7.2

WXR files typically need some cleanup to be successfully parsed - perform that here.

Parameters

$sourcefile: The raw WXR file as uploaded.

$destination: Filespec to which to write the cleaned-up WXR file. Omit when $namespaces_only == TRUE.

bool $unlink: Indicates whether $sourcefile will be deleted after preprocessing.

bool $namespaces_only: When TRUE, do not rewrite the file, simply gather and return the namespaces.

Return value

array List of referenced namespaces, keyed by prefix.

2 calls to WordPressBlog::preprocessFile()

WordPressMigrateWizard::sourceDataFormValidate in ./wordpress_migrate.migrate.inc: Fetch and preprocess the uploaded WXR file.
wordpress_migrate_update_7015 in ./wordpress_migrate.install: Updates to legacy wordpress migration arguments.

File

./wordpress.inc, line 357: Implementation of migration from WordPress into Drupal

Class

WordPressBlog

Code

public static function preprocessFile($sourcefile, $destination, $unlink = TRUE, $namespaces_only = FALSE) {

  // Cleanup some stuff in the process of moving the file to its final
  // destination
  $source_handle = fopen($sourcefile, 'r');
  if (!$namespaces_only) {
    $dest_handle = fopen($destination, 'w');
  }

  // First, get the header (everything before the <channel> element) to
  // rewrite the namespaces (skipping any empty lines).
  $header = '';
  while (($line = fgets($source_handle)) !== FALSE) {
    if (trim($line)) {
      $header .= $line;
      if (strpos($line, '<channel>') !== FALSE) {
        break;
      }
    }
  }

  // The excerpt namespace is sometimes omitted, stuff it in if necessary
  $excerpt_ns = 'xmlns:excerpt="http://wordpress.org/export/1.0/excerpt/"';
  $excerpt_signature = 'xmlns:excerpt="http://wordpress.org/export/';
  $content_ns = 'xmlns:content="http://purl.org/rss/1.0/modules/content/"';
  if (!strpos($header, $excerpt_signature)) {
    $header = str_replace($content_ns, $excerpt_ns . "\n\t" . $content_ns, $header);
  }

  // Add the Atom namespace, in case it's referenced
  $atom_ns = 'xmlns:atom="http://www.w3.org/2005/Atom"';
  $header = str_replace($content_ns, $atom_ns . "\n\t" . $content_ns, $header);

  // What the hell, throw in iTunes too
  $itunes_ns = 'xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"';
  $header = str_replace($content_ns, $itunes_ns . "\n\t" . $content_ns, $header);
  preg_match_all('|xmlns:(.+?)="(.+?)"|i', $header, $matches, PREG_SET_ORDER);
  $namespaces = array();
  foreach ($matches as $index => $match) {
    $namespaces[$match[1]] = $match[2];
  }
  if ($namespaces_only) {
    return $namespaces;
  }

  // Replace HTML entities with XML entities
  $header = strtr($header, self::$entityReplacements);
  fputs($dest_handle, $header);

  // Now, do some line-by-line fix-ups fix unencoded ampersands and bogus characters on a line-by-line basis
  while ($line = fgets($source_handle)) {

    // Handle unencoded ampersands
    $line = preg_replace('/&(?![\\w\\d#]+;)/', '&amp;', $line);

    // Remove control characters (the regex removes the newline, so tack it back on)
    $line = preg_replace('~\\p{C}+~u', '', $line) . "\n";

    // WordPress export doesn't properly format embedded CDATA sections - our
    // quick-and-dirty fix is to remove the terminator of the embedded section
    $line = preg_replace('|// \\]\\]|', '', $line);

    // Replace HTML entities with XML entities
    $line = strtr($line, self::$entityReplacements);
    fputs($dest_handle, $line);
  }
  fclose($dest_handle);
  fclose($source_handle);
  if ($unlink) {
    unlink($sourcefile);
  }
  return $namespaces;
}

You are here