Buscando aberraciones en datos basados ​​​​en el tiempo [cerrado]

Looking at IO latency data for a storage array. When a drive is about to fail, one indication is an increase in the IO operations 'time to complete'. The array kindly provides this data in this format:

     Time                           Disk Channels
    seconds   A      B      C      D      E      F      G      H      P      S
      0.017 5e98bd 64008b 62a559 68eb45 676ccf 5d3a86 46242b 62dd2e 6976c9 6da51f
      0.033 1e821c 1be769 1c372a 185134 19a2c2 21802c 2fa2ba 1d91c4 17b3ca 14cea6
      0.050  6638e  3a93b  4b19f  258aa  28b64  4d3ae  d92dc  32899  26a5b  1290d
      0.067   2df3   1c17   1f1b   180f   1291   1f05   5201   15f4   1856   10d8
      0.083    365    293    2b9    296    269    291    3c4    26f    2ae    25d
      0.100     ce     ae     94     aa     92     86     ce     81     9f     91
    ...

(time iterations go up to 2.00 seconds, counts are in hex).

The left column is the time the IO completes in, and the other columns are counts of IOs against a given spindle that completed in under that time.

When a drive is nearing failure, the 'tail' for that drive will get noticeably 'wider'... where most drives will have a small handful of IOs >0.2 seconds, failing drives can get lots of IOs over 0.2 seconds. Example :

    Time Disk channels
    seconds    A      B      C      D      E      F      G      H      P      S
    ...
    0.200      4    52d      2      7      3      2      1      6      1      8
    0.217      2    2a6      0      1      0      0      1      4      0      1
    0.233      0    1a1      0      1      0      0      0      1      1      0
    0.250      0     cb      0      1      0      0      1      1      0      1
    0.267      0     73      0      0      0      0      0      0      0      0
    0.283      0     44      0      0      0      0      0      0      0      0
    0.300      0     2d      0      0      0      0      0      0      0      0
    ...

I could just look for more than 10 IOs over 0.2 seconds, but I was looking for a mathematical model that could identify the failures more precisely. My first thought was to calculate the variance of each column... any set of drives with a range of variances that was too broad, would flag the offender. However, this falsely flags drives behaving normally:

    min variance is 0.0000437, max is 0.0001250.  <== a good set of drives
    min variance is 0.0000758, max is 0.0000939.  <== a set with one bad drive.

¿Alguna otra idea?

(And should this be on math.stackexchange.com rather than stackoverflow?)

preguntado el 03 de mayo de 12 a las 16:05

This seems like a homework question, albeit an interesting one. +1 -

With regards to the footnote, this might actually be more appropriate for estadísticas.stackexchange.com -

It's a work question, rather than a homework question. ;-) All the code works... it just doesn't do what I want. -

Stack Ethics question: can I just copy/paste the question to stats? or is that uncool? -

1 Respuestas

This is just a suggestion and not one involving any mathematical rigour, but could you not cut your dataset down to just include the > 0.2 second operations. Then using that dataset simply calculate the total of > 0.2 counts of each drive and work out the proportion of the total for each. If you then compare those proportions to each other (and the total) you should be able to identify failing drives. For example, if there is just one drive failing, its ratio to the other drives should be very high, and its value would be just slightly less than 100% (based on your sample data above). Similarly, if there were 2 drives failing, the other drive proportions should be very small and the 2 drives failing would have proportions of just under 50%. Thus comparing the ratios of these proportions with each other and comparing each proportion to the whole should give you a 'dirty' but simply way to identify failure candidates.

If you are looking for more statistical rigour, it might be worth taking a look at Kruskal-Wallis http://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance which is used to test whether a number of samples originate from the same statistical distrubution. The reason I mention this is because each disk sample is clearly not normally distributed and normality is not required for Kruskal-Wallis. It may not be suitable but could be a useful start point to research the correct statistical test for your data or until you find a statistics expert.

respondido 10 nov., 12:10

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.