¿Por qué XmlParser está convirtiendo mi cadena de código hexadecimal de caracteres a Unicode?

In my Grails application I use Groovy's XmlParser to parse an XML file. The value of one of the attributes in my XML file is a string that equals a character hex code. I want to save that string in my database:

Ñ

Por desgracia, la atributo El método devuelve el Ñ character and what actually gets stored in the database is c391. When the field is read back out I also get the Ñ character which is undesired.

How can I store the hex code as a string in my database and make sure it gets read back out as a hex code as well?

Actualización #1:

The reason this is a problem for me is that once I read the XML file into my database I must be able to reconstruct it exactly as it was. An additional problem is that the field in question isn't always a character hex code. It could just be some arbitrary string.

Actualización #2:

I guess it doesn't matter how the character is stored in the database, so long as I can write it back out in its expanded hex code format. I am using Groovy Creador de marcas to reconstruct my XML file from the database and I am unclear why this isn't happening by default.

Actualización #3:

Yo anulé getTableTypeString in my custom MySQL dialect and that seems to have helped things some what. At least now the value I pass to MySQL is the value that gets stored in the database.

class CustomMySQL5InnoDBDialect extends MySQL5InnoDBDialect {   
    @Override
    public String getTableTypeString() {
        return " ENGINE=InnoDB DEFAULT CHARSET=utf8"
    }
}

I also created my own version of groovy.util.XmlParser. My version is pretty much an exact duplicate of groovy.util.XmlParser except that in the startElement method I changed:

String value = list.getValue(i)

a esto:

def value = list.fAttributes.fAttributes[i].nonNormalizedValue
if(value ==~ /&#x([0-9A-F]+?);/) {
    value = list.fAttributes.fAttributes[i].nonNormalizedValue
}

This allows the exact text of hex code elements to be stored in the database.

Now there are two new problems, possibly three.

  1. Recreating a file with the exact values stored in the database. Up till now I had been using MarkupBuilder, but that is doing extra encoding on ampersands, causing the value Ñ to be written out as Ñ I can probably get around this by abandoning MarkupBuilder and building my XML strings manually, but I would rather not.

  2. Running an XSLT transform on an XML file using the Saxon-HE 9.4 processor causes some hex code values such as ÿ to be changed to something like ÿ, yet others like ™ are left unchanged.

  3. I'm not sure if this is going to be a problem yet or not, but when I recreate the file I would like it to be in ANSI encoding since that is the encoding used for the original file.

preguntado el 26 de septiembre de 13 a las 23:09

2 Respuestas

Ok, so given the xml:

def xml = '''<root>
    <node woo="&#xD1;"/>
    <another attr="This is an N-Tilde - &#xD1;"/>
</root>'''

We can read that attribute into a variable:

def woo = new XmlParser().parseText( xml ).node[0].@woo

And printing it out give us 'Ñ' (with a character value of 209)

But that's what I'd expect... as &#xD1; es el mismo que &#209; que es el correct encoding for N-tilde

Ahhh, so is the question "How can I read attributes, and keep them as-is without any entity resolving"?

I don't believe you can (all I've seen is negative answers from a search of the web)... What you could do is something like:

// Mask entities

xml = xml.replaceAll( /&#x([0-9A-F]+?);/, '!!#x$1;' )

def parser = new XmlParser().parseText( xml )

println parser.node[0].@attr.replaceAll( /!!#x([0-9A-F]+?);/, '&#x$1;' )
println parser.another[0].@attr.replaceAll( /!!#x([0-9A-F]+?);/, '&#x$1;' )

But as far as I know, there's not a method for tuning off entity resolution :-( (fingers crossed I'm wrong)

Respondido el 27 de Septiembre de 13 a las 10:09

The value of one of the attributes in my XML file is a string that equals a character hex code

No it isn't. The representación of the attribute value in the original XML is a hexadecimal character reference, but the propuesta de of the attribute is the character Ñ. There are ways to configure some XML parsers to avoid expanding named entidad references during parsing, but they debe: expand numeric character references as per the XML spec.

You haven't said why storing the real character value is a problem. If it's to do with rendering the value to a browser then that can be handled by using .encodeAsHTML() at output time. If you need to save the value to another XML file then use an XML API to do so and it will handle the encoding issues for you, replacing characters with entities or character references where this is required to keep the result well-formed (in the case of Ñ it doesn't need to be escaped anyway unless you're writing XML in an unusual character set).

In the specific case of Groovy's MarkupBuilder you can temporarily escape from XML mode and write hand-constructed markup directly to the output stream using mkp.yieldUnescaped, which would let you output a character reference somewhere the builder wouldn't normally bother.

Respondido el 29 de Septiembre de 13 a las 12:09

I updated my question to indicate why this is a problem for me. - ubiquibacón

@ubiquibacon if your code cares about the difference between Ñ, &#xD1;, &#xd1;, etc. then you can't use an XML tool to parse the data. An XML parser simply will not tell you which lexical representation was used in the original source. - Ian Roberts

I have added some new information to my question. It doesn't look like I can use any SAX based parser to read the character hex code as a string, but maybe you know a way I can make Groovy Creador de marcas (or equivalent) write out offending characters in their expanded hex code format. - ubiquibacón

@ubiquibacon an XML API will escape anything that needs to be escaped and not anything that doesn't. If you're writing XML in UTF-8 then Ñ can be written without escaping. If you write it as US-ASCII then it would get escaped as &#xD1; or &#209; or some other equivalent character reference. I'll say it again - if you care about this level of detail then you're not dealing with XML and you can't use an XML tool, instead you'll have to construct the markup yourself as strings. - Ian Roberts

I am writing the XML in UTF-8 and Ñ gets written to the file. Besides my requirement to be able to generate XML identical to what I had as an input the Saxon-HE parser chokes on that character. That is how I discovered this issue. - ubiquibacón

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.