You are here

public static function UTF8Utils::convertToUTF8 in Zircon Profile 8.0

Same name and namespace in other branches
  1. 8 vendor/masterminds/html5/src/HTML5/Parser/UTF8Utils.php \Masterminds\HTML5\Parser\UTF8Utils::convertToUTF8()

Convert data from the given encoding to UTF-8.

This has not yet been tested with charactersets other than UTF-8. It should work with ISO-8859-1/-13 and standard Latin Win charsets.

Parameters

string $data: The data to convert.

string $encoding: A valid encoding. Examples: http://www.php.net/manual/en/mbstring.supported-encodings.php

2 calls to UTF8Utils::convertToUTF8()
StringInputStream::__construct in vendor/masterminds/html5/src/HTML5/Parser/StringInputStream.php
Create a new InputStream wrapper.
UTF8UtilsTest::testConvertToUTF8 in vendor/masterminds/html5/test/HTML5/Parser/UTF8UtilsTest.php

File

vendor/masterminds/html5/src/HTML5/Parser/UTF8Utils.php, line 77

Class

UTF8Utils
UTF-8 Utilities

Namespace

Masterminds\HTML5\Parser

Code

public static function convertToUTF8($data, $encoding = 'UTF-8') {

  /*
   * From the HTML5 spec: Given an encoding, the bytes in the input stream must be converted to Unicode characters for the tokeniser, as described by the rules for that encoding, except that the leading U+FEFF BYTE ORDER MARK character, if any, must not be stripped by the encoding layer (it is stripped by the rule below). Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER code points.
   */

  // mb_convert_encoding is chosen over iconv because of a bug. The best
  // details for the bug are on http://us1.php.net/manual/en/function.iconv.php#108643
  // which contains links to the actual but reports as well as work around
  // details.
  if (function_exists('mb_convert_encoding')) {

    // mb library has the following behaviors:
    // - UTF-16 surrogates result in false.
    // - Overlongs and outside Plane 16 result in empty strings.
    // Before we run mb_convert_encoding we need to tell it what to do with
    // characters it does not know. This could be different than the parent
    // application executing this library so we store the value, change it
    // to our needs, and then change it back when we are done. This feels
    // a little excessive and it would be great if there was a better way.
    $save = ini_get('mbstring.substitute_character');
    ini_set('mbstring.substitute_character', "none");
    $data = mb_convert_encoding($data, 'UTF-8', $encoding);
    ini_set('mbstring.substitute_character', $save);
  }
  elseif (function_exists('iconv') && $encoding != 'auto') {

    // fprintf(STDOUT, "iconv found\n");
    // iconv has the following behaviors:
    // - Overlong representations are ignored.
    // - Beyond Plane 16 is replaced with a lower char.
    // - Incomplete sequences generate a warning.
    $data = @iconv($encoding, 'UTF-8//IGNORE', $data);
  }
  else {

    // we can make a conforming native implementation
    throw new Exception('Not implemented, please install mbstring or iconv');
  }

  /*
   * One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.
   */
  if (substr($data, 0, 3) === "") {
    $data = substr($data, 3);
  }
  return $data;
}