Data Processing and Statistical Analysis

School of Economics and Management
Beihang University
http://yanfei.site

In this lecture we are going to learn…

Some basic forecasting methods
Forecasting time series with complex seasonality
Dealing with missing values and outliers

Examples of time series

library(fpp2)
library(ggplot2)
library(forecast)

autoplot(beer, main = 'Australian quarterly beer production')

autoplot(pigs, main = 'Number of pigs slaughtered in Victoria')

autoplot(dowjones, main = 'Dow−Jones index')

Forecasting

Forecasting is estimating how the sequence of observations will continue into the future.
We usually think probabilistically about future sample paths
What range of values covers the possible sample paths with 80% probability?

beer <-  window(ausbeer, start = 1992, end = c(2005,4))
beer.arima <- auto.arima(beer)
autoplot(forecast(beer.arima))

Some basic forecasting methods

Very simple forecasting methods

Average method: forecast of all future values is equal to mean of historical data.
Naïve method: forecasts equal to last observed value.
Seasonal naïve method: forecasts equal to last value from same season.

## average method
beerfit.mean <- meanf(beer, h = 11)

## naive method
beerfit.naive <- naive(beer, h = 11)

## seasonal naive method
beerfit.snaive <- snaive(beer, h = 11)

cols <- c("mean" = "#0000ee","naive" = "#ee0000","snaive" = "green")

## plot
autoplot(beerfit.mean, PI = FALSE, main ='Forecasts for quarterly beer production') +
  geom_line(aes(x=time(beerfit.mean$mean),y=beerfit.mean$mean,colour='mean')) + 
  geom_line(aes(x=time(beerfit.naive$mean),y=beerfit.naive$mean,colour='naive')) + 
  geom_line(aes(x=time(beerfit.snaive$mean),y=beerfit.snaive$mean,colour='snaive')) +
  guides(fill=FALSE) +
  scale_colour_manual(name="Forecasts",values=cols)

Check residuals

## check whether the residuals are random
checkresiduals(beerfit.mean)

## 
##  Ljung-Box test
## 
## data:  Residuals from Mean
## Q* = 148.53, df = 7, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 8

checkresiduals(beerfit.naive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Naive method
## Q* = 130.18, df = 8, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 8

checkresiduals(beerfit.snaive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 32.302, df = 8, p-value = 8.224e-05
## 
## Model df: 0.   Total lags used: 8

Measures of forecast accuracy

## check accuracy of forecasting
beer2 <- window(ausbeer, start=2006)
accuracy(beerfit.mean, beer2)

##                         ME     RMSE      MAE        MPE     MAPE     MASE
## Training set  8.121418e-15 44.17630 35.91135 -0.9510944 7.995509 2.444228
## Test set     -1.718344e+01 38.01454 33.77760 -4.7345524 8.169955 2.298999
##                     ACF1 Theil's U
## Training set -0.12566970        NA
## Test set     -0.08286364 0.7901651

accuracy(beerfit.naive, beer2)

##                       ME     RMSE      MAE         MPE     MAPE     MASE
## Training set   0.7090909 66.60207 55.43636  -0.8987351 12.26632 3.773156
## Test set     -62.2727273 70.90647 63.90909 -15.5431822 15.87645 4.349833
##                     ACF1 Theil's U
## Training set -0.25475212        NA
## Test set     -0.08286364  1.428524

accuracy(beerfit.snaive, beer2)

##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -1.846154 17.24261 14.69231 -0.4803931 3.401224 1.0000000
## Test set     -2.545455 12.96849 11.27273 -0.7530978 2.729847 0.7672537
##                    ACF1 Theil's U
## Training set -0.3408329        NA
## Test set     -0.1786912   0.22573

Some basic forecasting methods

Forecasting based on time series decomposition

Time series can be decomposed into:

Trend: pattern exists when there is a long-term increase or decrease in the data.
Seasonal: pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week).
Cyclic: pattern exists when data exhibit rises and falls that are not of fixed period (duration usually of at least 2 years).
Remainder.

Time series decomposition

fit <- stl(elecequip, t.window=15, s.window="periodic", robust=TRUE)
autoplot(fit)

fcast <- forecast(fit, method="naive", h=24)
autoplot(fcast) + ylab("New orders index")

Some basic forecasting methods

ARIMA models

ARMA models: \(y_t = c + \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \theta_1 e_{t-1} + \cdots + \theta_1 e_{t-q}\).
ARIMA models: \((1-B)^dy_t\) follows an ARMA model.

ARIMA forecasting

Nonseasonal

fit <- auto.arima(internet)
autoplot(forecast(fit))

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 7.8338, df = 8, p-value = 0.4499
## 
## Model df: 2.   Total lags used: 10

Seasonal

fit <- auto.arima(h02)
autoplot(forecast(fit))

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,1)(0,1,2)[12]
## Q* = 26.081, df = 17, p-value = 0.07299
## 
## Model df: 7.   Total lags used: 24

Some basic forecasting methods

Time series regression models

The general form of a multiple regression model is \[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_k x_{k,t} + \epsilon_t.\]

Assumptions:

The relationship between the forecast variable and the predictor variables satisfies this linear equation.
Assumptions on the error term \(\epsilon_t\).

Time series regression models

Growth rates of personal consumption and personal income in the USA.

autoplot(uschange, facet=TRUE)

Build a regression model.

fit.consMR <- tslm(
  Consumption ~ Income + Production + Unemployment + Savings,
  data=uschange)
summary(fit.consMR)

## 
## Call:
## tslm(formula = Consumption ~ Income + Production + Unemployment + 
##     Savings, data = uschange)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88296 -0.17638 -0.03679  0.15251  1.20553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.26729    0.03721   7.184 1.68e-11 ***
## Income        0.71449    0.04219  16.934  < 2e-16 ***
## Production    0.04589    0.02588   1.773   0.0778 .  
## Unemployment -0.20477    0.10550  -1.941   0.0538 .  
## Savings      -0.04527    0.00278 -16.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3286 on 182 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.7486 
## F-statistic: 139.5 on 4 and 182 DF,  p-value: < 2.2e-16

Fitted values.

autoplot(uschange[,'Consumption'], series="Data") +
  autolayer(fitted(fit.consMR), series="Fitted") +
  xlab("Year") + ylab("") +
  ggtitle("Percent change in US consumption expenditure") +
  guides(colour=guide_legend(title=" "))

Checking with the model.

checkresiduals(fit.consMR)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 14.874, df = 8, p-value = 0.06163

Complex seasonality

We have considered simple seasonal patterns such as quarterly and monthly data.
Higher frequency time series often exhibit much more complicated seasonal patterns.
- Daily data may have a weekly pattern as well as an annual pattern.
- Hourly data usually has three types of seasonality: a daily pattern, a weekly pattern, and an annual pattern.
- Even weekly data can be challenging to forecast as it typically has an annual pattern with seasonal period of 365.25/7 ≈ 52.179 on average.

Example

p1 <- autoplot(calls) +
  ylab("Call volume") + xlab("Weeks") +
  scale_x_continuous(breaks=seq(1,33,by=2))
p2 <- autoplot(window(calls, end=4)) +
  ylab("Call volume") + xlab("Weeks") +
  scale_x_continuous(minor_breaks = seq(1,4,by=0.2))
gridExtra::grid.arrange(p1,p2)

Five-minute call volume handled on weekdays between 7:00am and 9:05pm in a large North American commercial bank. Top panel shows data from 3 March 2003 to 23 May 2003. Bottom panel shows only the first three weeks.

Complex seasonality

Time series decomposition

calls %>% mstl() %>%
  autoplot() + xlab("Week")

Multiple STL for the call volume data.

Forecasting based on decompostion

calls %>%  stlf() %>%
  autoplot() + xlab("Week")

Forecasting on the call volume data based on multiple stl decomposition.

Complex seasonality

Dynamic harmonic regression with multiple seasonal periods

In our case, the seasonal periods are 169 and 845, so the Fourier terms are of the form

\[sin(\frac{2\pi kt}{169}), ~cos(\frac{2\pi kt}{169}), ~sin(\frac{2\pi kt}{845}), ~cos(\frac{2\pi kt}{845}).\] for \(k = 1, 2, \dots\).

calls <- subset(calls, start=length(calls)-2000) 
fit <- auto.arima(calls, seasonal=FALSE, lambda=0,
         xreg=fourier(calls, K=c(10,10)))
fit %>%
  forecast(xreg=fourier(calls, K=c(10,10), h=2*169)) %>%
  autoplot(include=5*169) +
    ylab("Call volume") + xlab("Weeks")

Dealing with missing values and outliers

Dealing with missing values

We can replace the missing values with estimates.

gold2 <- na.interp(gold)
autoplot(gold2, series="Interpolated") +
  autolayer(gold, series="Original") +
  scale_color_manual(
    values=c(`Interpolated`="red",`Original`="gray"))

Daily morning gold prices for 1108 consecutive trading days beginning on 1 January 1985 and ending on 31 March 1989.

Outliers

The tsoutliers() function is designed to identify outliers, and to suggest potential replacement values.

tsoutliers(gold)

## $index
## [1] 770
## 
## $replacements
## [1] 494.9

Another useful function is tsclean() which identifies and replaces outliers, and also replaces missing values.

gold %>%
  tsclean() %>%
  auto.arima() %>%
  forecast(h=50) %>%
  autoplot()

In this lecture we are going to learn…

Examples of time series

Forecasting

Some basic forecasting methods

Some basic forecasting methods

Very simple forecasting methods

Check residuals

Measures of forecast accuracy

Some basic forecasting methods

Forecasting based on time series decomposition

Time series decomposition

Some basic forecasting methods

ARIMA models

ARIMA forecasting

Nonseasonal

Seasonal

Some basic forecasting methods

Time series regression models

Time series regression models

Growth rates of personal consumption and personal income in the USA.

Build a regression model.

Fitted values.

Checking with the model.

Complex seasonality

Complex seasonality

Example

Complex seasonality

Time series decomposition

Forecasting based on decompostion

Complex seasonality

Dynamic harmonic regression with multiple seasonal periods

Dealing with missing values and outliers

Dealing with missing values

Outliers

References