Manejo de errores de codificación al leer XML con PHP

I'm using XMLReader to parse XML from a 3rd party. The files are supposed to be UTF-8, but I'm getting this error:

parser error : Input is not proper UTF-8, indicate encoding !

Bytes: 0x11 0x72 0x20 0x41 in C:\file.php on line 166

Looking at the XML file in notepad++ it's clear what's causing this: there is a control character DC1 contained in the problematic line.

The XML file is provided by a 3rd party who I cannot reliably get to fix this/ensure it doesn't happen in the future. Could someone recommend a good way of dealing with this? I'd like to just do away with the control character -- in this particular case just deleting it from the XML file is fine -- but am concerned that always doing this could lead to unforeseen problems down the road. Thanks.

preguntado el 27 de agosto de 11 a las 15:08

Having code point U+0011, DEVICE CONTROL ONE, or Control-R, in a UTF-8 stream is not illegal or even improper UTF-8. That’s a perfectly valid code point as far as Unicode is concerned. XML may be a different story. -

Usted puede beneficiarse de esta Q&A -

3 Respuestas

Why can't the 3rd party reliably fix this issue? If they have illegal characters in their XML, I would wager that it's a valid issue.

Having said that, why not just remove the character before you parse it using str_replace?

Respondido 27 ago 11, 19:08

The 3rd party is a massive company providing the XML via an RSS feed and emailing them yielded a generic "we'll get back to you". Further, if they made this mistake once then I'd like to error on the cautious side and assume it could happen again, regardless of what they said. - The Mythical Bird

Ah, I see (and agree regarding the forward thinking). Would str_replace work for you pre-parse? - Demian Brecht

Using str_replace (or the like) is an option. But I don't know a lot about unicode and am not sure of how many potentially problematic characters there are like this one. If there's a lot, efficiency might become an issue since the XML files are large (>100 mb). - The Mythical Bird

Alternatively, you could implement your own parser and strip the characters as each line is parsed (illegal characters stored in an array). You're kinda bound to the overhead either way unless the company's willing to fix their bug.. - Demian Brecht

Puedes usar str_replace() provided that the string is IMPORTANTE UTF-8. Note that str_replace() will then work with byte offsets, so you are no longer dealing with PHP strings but with byte strings.

And there is the rub: if your 3rd party includes random whitespace and control characters that serve no purpose in XML, you might as well assume they eventually break UTF-8. So you can't use str_replace() with confidence (only in good faith) until you have ascertained that their current dump of the day is not entirely useless.

Maybe you could take a shortcut and stuff it in a libxml DOMDocument object and suppress errors with @, leaving the libxml library to deal with errors. Something like:

$doc = new DOMDocument();
if(@$doc->loadXML($raw_string)) {
  // document is loaded. time to normalize() it.
else {
  throw new Exception("This data is junk");

Respondido 27 ago 11, 20:08

Why are you and the third party exchanging data in XML? Presumably both parties expect to get some benefits by using XML rather than some random proprietary format. If you allow them to get away with generating bad XML (I prefer to call it non-XML), then neither party is getting these benefits. It's in their interests to mend their ways. Try to convince them of this.

Respondido 28 ago 11, 00:08

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.