conversión de matriz de bytes a unicode UTF-8

I have a file saved as UTF-8, and i'm reading it like this:

ReadFile(hFile, pContents, pFile->nFileSize, &dwRead, NULL);

(pContents is a BYTE* of size nFileSize)

its just a small file with 100 bytes or so, contains text which i want to read into memory in wchar_t* format, so i can set the text of edit and static controls with the unicode text.

How can i convert the bytes to UTF-8?

edit (i don't want to use fstream or wfstream)

preguntado el 08 de enero de 11 a las 17:01

Los bytes Médica already UTF-8, if you are reading UTF-8 encoded text. Neither C++ nor C care about the encoding, they just see an array of bytes. What exactly are you trying to do? -

i thought utf-8 was multibyte, like it needed sometimes 2 bytes to finish a character, mines just reading it into a byte array -

Right. So 2, 3, or 4 bytes from the array may together determine a character. This is UTF-8. You can't "convert" that into UTF-8 because it's already UTF-8. You podría convert it to UTF-32 for processing characters, but this is rarely useful in practice unless you're doing high-level text processing. Just leave it as UTF-8 unless you know a reason that won't work. -

There are multiple Unicode encodings. UTF-8 uses anything from 8 bits to 32 bits per codepoint, UTF-16 uses one or two 16 bits "code units" per codepoint, and UTF-32 uses 32 bits for every codepoint. The only way you could be certain that you will not run into "unfinished" characters would be to convert your data to UTF-32 and store each character using 4 bytes. -

Solo necesitas the Unicode code points to look up character properties like casing. If you have a program that takes action on ASCII characters while passing non-ASCII bytes around as-is (e.g., writing a CSV parser where only ,, ", \n have syntactic significance), then you can just leave your strings as UTF-8. That ASCII-compatibility was why UTF-8 was invented in the first place. -

3 Respuestas

Respondido el 08 de enero de 11 a las 20:01

If the file is in UTF-8 and you read it into an array.
Then it is still in UTF-8 format and you don;t need to do anything.

Respondido el 09 de enero de 11 a las 02:01

While this is correct in the technical sense of the word and doesn't deserve the downvote, it is non-the-less a tongue in cheek answer. Yes those bytes still represent a UTF-8 string, but they cannot be manipulated as such. You cannot even ask the question "How many characters do i have?" much less ask it to "Remove the last character". - v010dya

@Volodya: Its not meant as tongue in cheek and the only correct answer provided. The other two answers are incorrect. As they convert a UTF-8 array into UTF-16 array (The OP specifically requested a UTF-8 array (see question)). - Martin York

You point out a characteristic weakness in your comment about variable width character formats that has nothing to do with this question. Just like the above conversion functions there are equivalent functions to find the string length of a MBC string. You will also note that you can not find the string length directly for UTF-16 (as it is also a multi byte character format (you need to know where the pares sustitutos are and count them differently)). - Martin York

int res2 = WideCharToMultiByte(CP_UTF8, 0, tempBuf.c_str(), -1, 
                               multiByteBuf, lengthOfInputString, NULL, NULL);
int res = MultiByteToWideChar(CP_UTF8, 0, buf, -1, wcharBuf, lengthOfInputString);

Respondido 17 ago 12, 03:08

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.