Manera eficiente de leer una pequeña parte de un GRAN archivo XML en Java

We have a new requirement:

There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small. ... ...

What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?

Gracias

preguntado el 24 de agosto de 12 a las 20:08

4 Respuestas

The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.

Respondido 24 ago 12, 20:08

What would be your recommended method of halting the parsing once you've found the part you're interested in? IMHO the callback model that SAX uses doesn't lend itself well to that. - Alex

I'll have to agree with that. It is also important that you define the handler methods carefully (efficient and minimal code), otherwise you may end up building a solution that is not much better than a DOM based one. - Dan D.

I think XMLStreamReader (StAX) might be a better fit here. Since it gives you an iterator, you can just loop until you find what you're looking for, read it, then close the reader. - Alex

I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.

I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.

Respondido 24 ago 12, 21:08

Well, if you want to read a part of a file, you seguirá need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.

If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.

Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser aquí and for XPath aquí.

Respondido 24 ago 12, 20:08

Would XPath keep stuff in memory as it drills down the path? - Variable miserable

@MiserableVariable Java's built-in XPath implementatation accepts both DOM elements (like Document, Node, etc) as well as InputSource objects (which are backed by SAX). So it depends on how you use it. - Rajesh J Advani

@Alex Considering that there are no changes to any of the linked classes since Java 5, does it matter what version of the documentation is linked? - Rajesh J Advani

STAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.

Respondido 25 ago 12, 21:08

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.