Reemplazar el valor faltante por hora con el promedio anual de esa hora del día

I have a hourly dataset from 1996 - 2010 in the following format:

             date         value  
1- - -1996-01-01 00:00:00- - -  NA  
2- - -1996-01-01 01:00:00- - -  38  
3 - - -1996-01-01 02:00:00- - - 44  
4- - -1996-01-01 03:00:00- - -  48  
5- - -1996-01-01 04:00:00- - -  42  
6- - -1996-01-01 05:00:00- - -  44  
7- - - 1996-01-01 06:00:00- - - 38  
8- - - 1996-01-01 07:00:00- - - 42  
9- - -1996-01-01 08:00:00- - -  44  
10- - -1996-01-01 09:00:00- - - 44  

I have lot of missing hours data and I am tring to calculate the missing hour values by taking the average of previous and next hour, and if many hours missing I am trying to calculate by taking the average of that hour for every day that year, just wondering if this is possible?

I have tried the following but this gives me average of the complete dataset:

a = c(NA, 1, 2, 3, 10)   
a[which(is.na(a)==TRUE)] = mean(a,na.rm = T)

I would really appreciate any advice about how I should proceed with this calculation.

preguntado el 09 de marzo de 12 a las 14:03

2 Respuestas

na.aggregate in zoo does that. Its only one line of code to fill in the missing values:

# read in the data

Lines <- "1996-01-01 00:00:00 NA  
1996-01-01 01:00:00 38  
1996-01-01 02:00:00 43
1997-01-01 00:00:00 44  
1997-01-01 01:00:00 45"

library(zoo)
library(chron)
z <- read.zoo(text = Lines, index = 1:2, FUN = paste, FUN2 = as.chron)

# fill in the missing values

na.aggregate(z, hours, FUN = mean)

respondido 10 mar '12, 00:03

Thanks for your reply too, I will try this monday and will let you know how it went, thanks - Achak

You could probably do this using some handy function from the zoo package. For instance, na.approx, con maxgap = 1 should linearly interpolate all the gaps of length one. Then you'd probably want to use na.aggregate, splitting by year and hour, to fill longer gaps with that periods' mean.

Here's a simple example to give you a sense of how these functions work:

set.seed(124)
tt <- as.POSIXct("2000-01-01 10:00:00") + 3600*c(1:100,10000:10100)
dd <- runif(201)

aa <- data.frame(x1 = tt,x2 = dd)
aa$x2[sample(201,30)] <- NA
aa$x3 <- na.approx(aa$x2,maxgap = 1)
aa$x4 <- na.aggregate(aa$x3,by = format(aa$x1,"%Y-%H"))

Note that if your series has leading or trailing NAs, you might get errors, since the "linear interpolation" piece doesn't make much sense in that case. So you'd have to fill those in some other way.

respondido 09 mar '12, 15:03

Thanks for your advise, I will try this monday and will let you know how it goes! Many thanks - Achak

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.