La optimización del código Java conduce a inexactitudes y errores numéricos

I'm trying to implement a version of the Fuzzy C-Means algorithm in Java and I'm trying to do some optimization by computing just once everything that can be computed just once.

This is an iterative algorithm and regarding the updating of a matrix, the pixels x clusters membership matrix U (the sum of the values in a row must be 1.0), this is the update rule I want to optimize:

texto alternativo

where the x are the element of a matrix X (pixels x features) and v belongs to the matrix V (clusters x features). Y m is a parameter that ranges from 1.1 a infinity y c is the number of clusters. The distance used is the euclidean norm.

If I had to implement this formula in a banal way I'd do:

    for(int i = 0; i < X.length; i++)
    {
        int count = 0;
        for(int j = 0; j < V.length; j++)
        {
                double num = D[i][j];
                double sumTerms = 0;
                for(int k = 0; k < V.length; k++)
                {
                     double thisDistance = D[i][k];
                     sumTerms += Math.pow(num / thisDistance, (1.0 / (m - 1.0)));
                }
                U[i][j] = (float) (1f / sumTerms);
         }
     }

In this way some optimization is already done, I precomputed all the possible squared distances between X y V and stored them in a matrix D but that is not enough, since I'm cycling througn the elements of V two times resulting in two nested loops. Looking at the formula the numerator of the fraction is independent of the sum so I can compute numerator and denominator independently and the denominator can be computed just once for each pixel. So I came to a solution like this:

    int nClusters = V.length;
    double exp = (1.0 / (m - 1.0));
    for(int i = 0; i < X.length; i++)
    {
        int count = 0;
        for(int j = 0; j < nClusters; j++)
        {
             double distance = D[i][j];
             double denominator = D[i][nClusters];
             double numerator = Math.pow(distance, exp);
             U[i][j] = (float) (1f / (numerator * denominator));
        }
     }

Where I precomputed the denominator into an additional column of the matrix D while I was computing the distances:

    for (int i = 0; i < X.length; i++)
    {
        for (int j = 0; j < V.length; j++)
        {
            double sum = 0;
            for (int k = 0; k < nDims; k++)
            {
                final double d = X[i][k] - V[j][k];
                sum += d * d;
            }

            D[i][j] = sum;
            D[i][B.length] += Math.pow(1 / D[i][j], exp);
        }
    }

By doing so I encounter numerical differences between the 'banal' computation and the second one that leads to different numerical value in U (not in the first iterates but soon enough). I guess that the problem is that exponentiate very small numbers to high values (the elements of U can range from 0.0 to 1.0 and exp , Para m = 1.1, es 10) leads to very small values, whereas by dividing the numerator and the denominator and ENTONCES exponentiating the result seems to be better numerically. The problem is it involves much more operations.

ACTUALIZACIÓN

Some values I get at ITERATION 0:

This is the first row of the matrix D not optimized:

384.6632 44482.727 17379.088 1245.4205

This is the first row of the matrix D the optimized way (note that the last value is the precomputed denominator):

384.6657 44482.7215 17379.0847 1245.4225 1.4098E-26

This is the first row of the U not optimized:

0.99999213 2.3382613E-21 2.8218658E-17 7.900302E-6

This is the first row of the U optimized:

0.9999921 2.338395E-21 2.822035E-17 7.900674E-6

ITERATION 1:

This is the first row of the matrix D not optimized:

414.3861 44469.39 17300.092 1197.7633

This is the first row of the matrix D the optimized way (note that the last value is the precomputed denominator):

414.3880 44469.38 17300.090 1197.7657 2.0796E-26

This is the first row of the U not optimized:

0.99997544 4.9366603E-21 6.216704E-17 2.4565863E-5

This is the first row of the U optimized:

0.3220644 1.5900239E-21 2.0023086E-17 7.912171E-6

The last set of values shows that they are very different due to a propagated error (I still hope I'm doing some mistake) and even the constraint that the sum of those values must be 1.0 is violated.

Am I doing something wrong? Is there a possible solution to get both the code optimized and numerically stable? Any suggestion or criticism will be appreciated.

preguntado el 09 de enero de 11 a las 09:01

Quick comment before looking at this, your formula has xj-vi and xj-vk, but your banal algorithm uses D[i][j] and D[i][k] instead. Am I missing something or should the second be D[j][k]? -

You are correct, they are different but it does not influence the result. I will use a different formula to be more precise, thank you for pointing it out -

Not trying to be annoying, just trying to get make sure I understand; but along the previous comment's line. - The above formula is squaring the sums; the banal code does not seem to be doing so; the optimized code seems to square the denominator not the numerator. The optimized code is using V[j][k] instead of V[i][k]. I'm guessing c in your formula represents the size of V; might also be good to range from k=0 instead of k=1 just for clarity. What is B and B.length and why are you using +=? -

Numerical differences? do you have example values? Reordering floating point operations almost always causes "small" differences. -

I'm seeing the same things that aepryus is seing, but ultimately your question is about the math, not about the algorithm. My first thought is that your running up against the ISO Decimal Math problem. Check it with java.util.BigDecimal -- it'll kill your performance but you should see numeric stability -- and if it does you're going to have to start weighing tradeoffs. -

1 Respuestas

This problem has nothing to do with floating point stability.

You are getting wrong denominator values on the second and following iterations because you forget to clear its cell before accumulating the sum.

The right denominator for iteration 1 is 6.697905e-27, that is almost 2.0796E-26 - 1.4098E-26.

Respondido el 09 de enero de 11 a las 21:01

I feel so dumb and stupid now. I should have arrived at your conclusion. I thank you for spending your time on my stupidity's work - rano

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.