Problemas con el análisis simple de DOM de Java

Could someone please explain why this is happening. I have simplified my problem by created a simple program, but see details about the problem I’m facing:

String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<title text=\"title1\">\n" +
"    <comment id=\"comment1\">\n" +
"        <data> abcd </data>\n" +
"        <data> efgh </data>\n" +
"    </comment>\n" +
"    <comment id=\"comment2\">\n" +
"        <data> ijkl </data>\n" +
"        <data> mnop </data>\n" +
"        <data> qrst </data>\n" +
"    </comment>\n" +
"</title>\n";

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(xml)));

System.out.println(doc.getFirstChild().getNodeName());
System.out.println(doc.getFirstChild().getFirstChild().getNodeName());

The corresponding output it:

title
#text

Firstly, why can’t I get the comment ¿nodo?

Secondly, why does the data node get interpreted as a #text ¿nodo?

What would be the correct and simple way to get the required nodes. Please also note that the XML file is not fixed; I want an arbitrary solution. thanks.

EDIT:

I get a similar problem when using Xpath, see the code below:

XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("/title/comment/data/text()");
NodeList result = (NodeList) expr.evaluate(msg.document(), XPathConstants.NODESET);
for(int i = 0; i < result.getLength(); i++)
    System.out.println(result.item(i).getNodeName() + " : " + result.item(i).getNodeValue());

Esto da la salida:

#text :  abcd 
#text :  efgh 
#text :  ijkl 
#text :  mnop 
#text :  qrst 

preguntado el 27 de agosto de 11 a las 15:08

2 Respuestas

The first node of the title node is a text node containing the \n and the four spaces before the <comment> element starts.

To get the comment node, ask its parent for its second node, or for its first element by tag name "comment". You may also loop through the childs and return the first node of type ELEMENT_NODE.

<data> is an element node containing a text node. The value of the text node is " abcd ".

Respondido 27 ago 11, 19:08

thanks, I added some other code I am using for Xpath. Still can’t understand why I can’t get the data node. I am expecting this to be a node where nodename is data and textcontent is “abcd”. Am I understanding something wrong? - Larry

You're asking for .../data/text(), so it returns the text. Ask for .../data, and it will return the data elements. Each of the returned data element will have a single child, which will be a text node. - JB Nizet

thanks this finally makes sense... likewise, for each data element I can call .getTextContent() which also returns the inner text node value - Larry

@JB Nizet's explanation of what is happening is correct.

One possible workaround would be to configure the parser to ignore "ignorable whitespace" by calling setIgnoringElementContentWhitespace() al DocumentBuilderFactory. I understand that this will cause the parse to not generate those unwanted Text nodes for the whitespace between the tags.

Respondido 27 ago 11, 19:08

AFAIK this works only in "validating mode", so you must supply some DTD/XML Schema definition ("this setting requires the parser to be in validating mode"). - Grzegorz Szpetkowski

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.