Inicializando caracteres extraños en Java

Estoy tratando de usar algunos caracteres extravagantes en mi código Java.

    Character c = new Character('

asked Jan 08 '11, 17:01

I'm not sure if it will affect what you're doing, but note that Java 6 was written to Unicode 4. Some methods will treat U+1F000 as an undefined code point. -

2 Answers

Non-BMP (basic multilingual plane) characters can't be represented as a Java char (or thus a Character), because a char is only a 16-bit unsigned integer. Non-BMP characters are represented using surrogate pairs in Java.

You'll need to use a string... but even then I suspect you'll need to provide the surrogate pair of characters explicitly. C# has a \U escape sequence which is the equivalent of \u but for 32-bit values, but Java doesn't have anything like that :(

Here's an alternative approach which lets you use the Unicode value directly in your code:

String x = new String(new int[] { 0x1f000 }, 0, 1);

It's ugly, but it works...

answered Jan 08 '11, 20:01

@Jon - you are correct, Java doesn't support \U: java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1 - McDowell

@McDowell: Thanks. Will remove the element of doubt from the answer :) - Jon Skeet

Ehm. Java String literals do allow for \u escaping so this character would simply be String s = "\u1f000"; What's I believe that really is going on here is that since that particular character is in Unicode 6.0 code page, it's not supported by Java since Unicode 6.0 support will be added in JDK7. - Esko

@Esko: no it has nothing to do with Unicode 6.0. Java starting showing its shortsighted 16-bits char SNAFU as soon as Unicode 3.1 came out. The \u construct was as poorly thought as char: it needs exactly four hexdigits following '\u' and that's it. Try to encode a musical note from outside the BMP using \uxxxx: they exist since Unicode 3.1 and you cannot encode them using \uxxxx. To me it's not at all related to Unicode 6.0. The issue is much older than that. - SyntaxT3rr0r

@Esko - some languages use the upper case escape \U sequences for code points outside the BMP - see here for examples: illegalargumentexception.blogspot.com/2010/04/… Java only supports the lower case \u followed by four hexadecimal digits. - McDowell

Just an alternative, but you can also use:

String str = new String(Character.toChars(0x1F000) );

answered Jan 08 '11, 20:01

you could also use String str = "\ud83c\udc00";, but this obfuscates the code point. - McDowell

@mcd You're right of course, but I'd prefer to let the Character class do the heavy lifting of translating into surrogate pairs :) - robert_x44

Not the answer you're looking for? Browse other questions tagged or ask your own question.