Representación interna de cadenas en PHP

I'm writing a simple website parser on PHP 5.2.10.
When using default internal encoding (which is ISO-8859-1), I get an error always at the same function call:

$start = mb_strpos($index, '<a name=gr1>');

Error fatal: tamaño de memoria permitido de 50331648 bytes agotado (intentó asignar 11924760 bytes)

The length of the string $index in this case was 2981190 bytes - exactly 4 times less than PHP tried to allocate.

Ahora, si uso

mb_internal_encoding('UTF-8')

the error disappears. Does that mean that PHP uses more memory for single-byte strings that for multibyte ones? How's that possible? Any ideas?

UPD: Memory usage doesn't seem to depend on encoding: average memory_get_usage() is almost the same using UTF-8 and ISO-8859-1. I think that the problem might be in mb_strpos. In fact, the string $index has Windows-1251 encoding (cyrillic), so it contains symbols that are not valid for UTF-8. This may cause mb_strpos to somehow try to convert or just use the additional memory for some needs. Will try to find the answer in the sources of mb_strpos.

preguntado el 25 de agosto de 12 a las 21:08

have you considered upgrading your PHP? Firstly because 5.2 is no longer supported, and secondly because both 5.3 and 5.4 releases had significant memory usage improvements (particularly 5.3). Not sure if those improvements include mb_strpos(), but it's worth the upgrade in any case. -

Think your update is on right track. A number of things might influence... mb_detect_order, use of 'auto' or 'pass', to name a few. Using iconv can be a good way to make sure your strings are "sane" and match detected/set encoding. Would like to profile and see what it's doing with those 1252 control codes. Oh evil m-dash. -

I have updated to 5.3, this trouble did not disappear. -

I have temporarily solved the problem by using iconv to convert the string to UTF-8 and setting them as internal encoding. Shall profile the PHP sources a bit later. -

1 Respuestas

Sorry if you've already thought of these potential issues.

The multibyte string functions will check UTF-8 encodings for errors and, if there are invalid characters, returns an empty string or false (as in the case of mb_strpos(): http://www.serverphorums.com/read.php?7,552099

Are you checking the result you're getting using the === operator to ensure that you're not receiving false en lugar de 0?

El mb_strpos() usos de la función mbfl_strpos(), which makes copies of the strings (needle, haystack) when it has to perform conversions (leading to increases in memory, as you observed): https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfilter.c#L811

So, I'm wondering if using the default internal encoding (ISO-8859-1) let everything through, and the memory limit was hit, whereas the utf-8 encoding short circuited due to the illegal characters and returned false (which, if you were testing with ==, would make it appear that the function merely didn't find a match.)

Worth a shot :)

Respondido 03 Feb 17, 18:02

A nice shot! For check whether the result is false or 0 I've written a function alike assert(), check is performed strictly (===). But now I unserstand why does PHP need 4 times strlen memory - in fact, it converts both arguments to UTF-8 (and not a mb_internal_encoding()). Thanks for your research and the sources attached! ;) - Dmitry

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.