I'm writing a simple website parser on PHP 5.2.10.
When using default internal encoding (which is ISO-8859-1), I get an error always at the same function call:
$start = mb_strpos($index, '<a name=gr1>');
Error fatal: tamaño de memoria permitido de 50331648 bytes agotado (intentó asignar 11924760 bytes)
The length of the string $index in this case was 2981190 bytes - exactly 4 times less than PHP tried to allocate.
Ahora, si uso
the error disappears. Does that mean that PHP uses more memory for single-byte strings that for multibyte ones? How's that possible? Any ideas?
UPD: Memory usage doesn't seem to depend on encoding: average memory_get_usage() is almost the same using UTF-8 and ISO-8859-1. I think that the problem might be in mb_strpos. In fact, the string $index has Windows-1251 encoding (cyrillic), so it contains symbols that are not valid for UTF-8. This may cause mb_strpos to somehow try to convert or just use the additional memory for some needs. Will try to find the answer in the sources of mb_strpos.
preguntado el 25 de agosto de 12 a las 21:08
Sorry if you've already thought of these potential issues.
The multibyte string functions will check UTF-8 encodings for errors and, if there are invalid characters, returns an empty string or false (as in the case of mb_strpos(): http://www.serverphorums.com/read.php?7,552099
Are you checking the result you're getting using the
=== operator to ensure that you're not receiving
false en lugar de
mb_strpos() usos de la función
mbfl_strpos(), which makes copies of the strings (needle, haystack) when it has to perform conversions (leading to increases in memory, as you observed):
So, I'm wondering if using the default internal encoding (ISO-8859-1) let everything through, and the memory limit was hit, whereas the utf-8 encoding short circuited due to the illegal characters and returned false (which, if you were testing with
==, would make it appear that the function merely didn't find a match.)
Worth a shot :)