¿Cuál es el algoritmo correcto para determinar el número de caracteres percibidos por el usuario?

I have the task of counting the number of perceived characters in an input. The input is a Grupo of ints (we can think of it as an int[]) which represents Unicode code points.

java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<)

I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?

Inicialmente, I'd thought that all I have to do is to combine all occurences of:

  1. U+0300 – U+036F (combining diacritical marks)

  2. U+1DC0 – U+1DFF (combining diacritical marks supplement)

  3. U+20D0 – U+20FF (combining diacritical marks for symbols)

  4. U+FE20 - U+FE2F (combining half marks)

into the previous non-diacritic-mark.

However I've dio cuenta that prior to that operation, I have to first remove all non-characters as well.

Esto incluye:

  1. U+FDD0 - U+FDEF

  2. The last two code points of every plane

But there seems to be more things to do. Unicode.org states we need to include U+200C (zero-width non joiner) and U+200D (zero width joiner) as part of the set of continuing characters (Fuente).

Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables?

Does anyone know the correct algorithm to count the number of grapheme-clusters given an int[] of code points?

preguntado el 01 de febrero de 12 a las 14:02

In esta respuesta I gave a reference to the Unicode algorithm for splitting text into grapheme clusters, and demonstrated how to implement some of it in Python. -

@GarethRees I think that there's many important steps missing in that answer to do a correct user-perceived-character check (correct me if I'm wrong). -

Can you use the ICU BreakIterator? icu-project.org/apiref/icu4j/com/ibm/icu/text/… -

@JoniSalonen No nivel más alto functions which masks the implementation details, that would defeat the purpose, unless of course they could show their algorithm-breakdown. -

Don't forget the "Regional Indicator Symbols", U+1F1E6 to U+1F1FF, those are supposed to be combined in pairs to make flag emojis (e.g. "R" + "U" for a Russian flag), but java.text.BreakIterator.getCharacterInstance() at least returns the two code points separately. -

1 Respuestas

There's not a single canonical method appropriate to all uses, but a good starting point is the Unicode Grapheme Cluster Boundary algorithm on the Unicode.org page you link to. Basically, Unicode provides a database of each code point's grapheme break property, and then describes an algorithm to decide if a grapheme break is allowed between two code points based on their assigned grapheme break properties.

Here's part of an implementation (in C++) I played around with a while ago:

bool BoundaryAllowed(char32_t cp, char32_t cp2) {
  // lbp: left break property; rbp: right break property
  auto lbp = get_property_for_codepoint(cp),
       rbp = get_property_for_codepoint(cp2);

  // Do not break between a CR and LF. Otherwise, break before and after
  // controls.
  if ((CR == lbp && LF == rbp)) {
    // The Unicode grapheme boundary algorithm does not handle LFCR new lines
    return false;

  if (Control == lbp || CR == lbp || LF == lbp || Control == rbp || CR == rbp ||
      LF == rbp) {
    return true;

  // Do not break Hangul syllable sequences.
  if ((L == lbp && (L == rbp || V == rbp || LV == rbp || LVT == rbp)) ||
      ((LV == lbp || V == lbp) && (V == rbp || T == rbp)) ||
      ((LVT == lbp || T == lbp) && (T == rbp))) {
    return false;

  // Do not break before extending characters.
  if (Extend == rbp) {
    return false;

  // Do not break before SpacingMarks, or after Prepend characters.
  if (Prepend == lbp || SpacingMark == rbp) {
    return false;

  return true; // Otherwise, break everywhere.

In order to obtain the ranges for the different types of codepoints you'll just have to look at the Unicode Character Database. The file with the grapheme break properties, which describes them in terms of the ranges, is about 1200 lines long: http://www.unicode.org/Public/6.1.0/ucd/auxiliary/

I'm not really sure how much value there is in ignoring non-character code points, but if your use requires it then you'll add that in to your implementation.

Respondido 13 ago 13, 20:08

Do you understand why is it that U+0599 is listed in Mark/NonSpace Combining category however isn't one of the combining characters? Basically when determining grapheme clusters, should we use the data from combining character or for break properties, or both? - Pacerier

I think the combining character ranges you listed don't include all combining characters for all languages. It looks like they're only ones that are pretty general and apply across languages while U+0599 is specific to Hebrew. I would use the grapheme break properties since it should be a comprehensive list, while if you only look at Unicode blocks whose names includes "COMBINING" it looks like you'll miss ~90% of them. - bames53

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.