hi I am doing html pursing with php dom document I am getting some special charators in my result how do I filter the same??

foreach ($fdats as $fdat)
    foreach($fdat->getElementsByTagName('a') as $mdat)
                $comb[] = trim($mdat->nodeValue);


and the HTML is something like this

<div class="content1" id="user" style="width: 47%; margin-right: 20px;">
<div class="ad  first_row">
<p class="ad" style="width: 70%;">
<a href="/es/site/users"><img class="dynamic-icon">&nbsp; James</a>

la salida es  James, and how do I get rid of Â

they are called html entities. you can convert them into their true form using the following function:


además, &nbsp; converts to ascii code 160, which is a double byte character. this is why it is showing up as a weird character. you may need to use iconv() function if you want to strip double byte characters.

$text = iconv("UTF-8", "ISO-8859-1//IGNORE", $text);


Creo que el Â_ is an UTF-8 materialization. The &nbsp; becomes the unicode character U+00A0 when extracted via DOM methods.

Probablemente puedas usar utf8_decode() antes de trim() deshacerse de eso. That should convert it into a regular space. Hmm, maybe not. Latin-1 has it's own nbsp at 0xA0. So better use a regex /\s/U might cover it.

