Error de segmentación del clasificador de bosque aleatorio

been trying to run the RF classifier on a data set of ~50,000 entries with 20 or so labels which I thought should be fine but I keep coming across the following when trying to fit...

Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize' ignored
Segmentation fault (core dumped)

The data set has been passed through the TfidfVectorizer and then TruncatedSVD with n=100 for dimensionality reduction. RandomForestClassifier is running with n_jobs=1 and n_estimators=10 in an attempt to get find the minimum point at which it will work. The system is running with 4GB of RAM and RF has worked in the past on a similar data set with much higher numbers of estimators etc. Scikit-learn is running at the current version 0.14.1.

¿Algun consejo?

Muchas Gracias

preguntado el 27 de noviembre de 13 a las 01:11

try running it on a smaller amount of data or a lower n. See if the same error comes up. I also have 4GB of RAM and I have used random forests on some pretty large datasets (not this large) and have never gotten this error. Please tell me if you still get the same error when lowering that. -

You should report this on sckikit bug tracker if it is reproduceable. -

I would try reducing the size of the dataset suggested by @RyanSaxe. I have run the RF on datasets much larger than this but had access to a very large amount of RAM -

I agree with IanSR that this is a bug. However, the memory required by a random forest algorithm depends not just on the number of samples and the dimensionality of the data but also the complexity of the data. If each decision tree has more nodes the forest becomes larger. An extreme case for a very small forest would be the data can be perfectly separated by a single decision node. -

2 Respuestas

Segfaults are always bugs. If a malloc fails inside RandomForest then it should be caught, and it is my best guess that this is what is happening to you. As a commenter already said, you should report this to the RandomForest bug tracker. But the malloc is probablemente failing because of an out of memory condition, so reduce your dimensionality, reduce your training data set size, get more memory, or run on a system with more memory.

Respondido el 23 de enero de 14 a las 21:01

Try to use 'psutil' library (link: Thanks to this library, you can monitorize the amount of available memory of your system with the following function:


This will help you to detect if your system does not have enough memory or if it is a problem of your code.

Respondido el 07 de Septiembre de 16 a las 14:09

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.