perl: convierte una cadena a utf-8 para decodificar json
Frecuentes
Visto 17,250 veces
8
I'm crawling a website and collecting information from its JSON. The results are saved in a hash. But some of the pages give me "malformed UTF-8 character in JSON string" error. I notice that the last letter in "cafe" will produce error. I think it is because of the mix of character types. So now I'm looking for a way to convert all types of character to utf-8 (hope there is a way perfect like that). I tried utf8::all, it just doesn't work (maybe I didn't do it right). I'm a noob. Please help, thanks.
ACTUALIZACIÓN
Well, after I read the article "Conozca la diferencia entre cadenas de caracteres y cadenas UTF-8" Posted by brian d foy. I solve the problem with the codes:
use utf8;
use Encode qw(encode_utf8);
use JSON;
my $json_data = qq( { "cat" : "Büster" } );
$json_data = encode_utf8( $json_data );
my $perl_hash = decode_json( $json_data );
Hope this help some one else.
1 Respuestas
24
decode_json
expects the JSON to have been encoded using UTF-8.
While your source file is encoded using UTF-8, you have Perl decode it by using use utf8;
(as you should). This means your string contains Unicode characters, not the UTF-8 bytes that represent those characters.
As you've shown, you could encode the string before passing it to decode_json
.
use utf8;
use Encode qw( encode_utf8 );
use JSON qw( decode_json );
my $data_json = qq( { "cat" : "Büster" } );
my $data = JSON->new->utf8(1)->decode(encode_utf8($data_json));
-or-
my $data = JSON->new->utf8->decode(encode_utf8($data_json));
-or-
my $data = decode_json(encode_utf8($data_json));
But you could simply tell JSON that the string is already decoded.
use utf8;
use JSON qw( from_json );
my $data_json = qq( { "cat" : "Büster" } );
my $data = JSON->new->utf8(0)->decode($data_json);
-or-
my $data = JSON->new->decode($data_json);
-or-
my $data = from_json($data_json);
Respondido el 18 de Septiembre de 21 a las 14:09
"But you could simply tell JSON that the string is already decoded." Do you mean that the input of the decode function is already encoded to utf-8? - ivan wang
The question makes no sense. There's no "is", there is only "must be". Whether the input to $json->decode
must be UTF-8 encoded or must not be encoded depends on whether you are using JSON->new->utf8(1)->decode
(también llamado decode_json
) (input must be UTF-8) or JSON->new->utf8(0)->decode
(input must be Unicode chars). - Ikegami
No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas json perl utf-8 or haz tu propia pregunta.
Also, you might look at whatever your web user-agent is doing and tell it not to decode the body. That should give you the raw octets so you don't have to encode what it decoded. - brian d foy