Throughout this project, we will be building a model to forecast monthly electricity consumption from American Electric Power Company (AEP) clients. This model will be based solely on time series data.
We will mainly be using the following packages:
library(tidyverse) # includes ggplot2 and various data manipulation packages
library(lubridate) # To manipulate date and time data
library(ggfortify) # Allows ggplot2 to decompose and plot time series objects
library(forecast) # Contains forecasting methods
library(tseries) # For time series creation and analysis (including adf test)
library(nnfor) # To build neural network models with time series
library(gridExtra) # To visualize plots in grid format
First, we need to import our dataset. The dataset we will be using can be found online at: https://www.kaggle.com/robikscube/hourly-energy-consumption#AEP_hourly.csv
It was built from public data made available by PJM Interconnection and contains electricity consumption in MW (including consumption by AEP clients), for each hour between December 31, 2004 and January 2, 2008. It contains 121273 observations.
library(readr)
AEP_hourly <- read_csv("hourly-energy-consumption/AEP_hourly.csv")
head(AEP_hourly)
Our dataset is quite extensive and starts in 2004. In order to concentrate on the last 11 years, we can filter our dataset to only keep data from January 1, 2007 to December 31, 2017.
AEP_hourly <- AEP_hourly %>% filter(Datetime >= ymd_hms("2007-01-01 00:00:00") &
Datetime <ymd_hms("2018-01-01 00:00:00"))
Creating new variables
We can now clean our existing variables and create new variables that may later be used for exploration.
names(AEP_hourly)[2] <- "MW"
AEP_hourly$TW <- AEP_hourly$MW/1000000
AEP_hourly$date <- as.Date(AEP_hourly$Datetime ,format='%m/%d/%Y')
AEP_hourly$year <- as.numeric(format(AEP_hourly$Datetime, "%Y"))
AEP_hourly$month <- as.numeric(format(AEP_hourly$Datetime, "%m"))
AEP_hourly$wday <- wday(AEP_hourly$date)
AEP_hourly$day <- as.numeric(format(AEP_hourly$Datetime, "%d"))
AEP_hourly$hour <- as.numeric(format(AEP_hourly$Datetime, "%H"))
AEP_hourly$hour[AEP_hourly$hour == 0] <- 24
head(AEP_hourly)
Duplicates and Missing Values
This is a good time to check for duplicates and missing values:
anyDuplicated(AEP_hourly)
## [1] 0
sum(is.na(AEP_hourly))
## [1] 0
Building a monthly and annual dataset
Finally, we can build a second dataset which will aggregate monthly/yearly data.
AEP_monthly <- aggregate(TW ~ month + year, AEP_hourly, sum)
head(AEP_monthly)
We can now start exploring our dataset. This step is crucial to gain a better understanding of forms and rules that may be found within our data.
First, we will plot all our data:
annual <- ggplot(AEP_hourly, aes(x = date, y = MW)) +
geom_point(alpha = 0.018, color = "#00AFBB") +
geom_hline(yintercept=mean(AEP_hourly$MW),
color="#FC4E07",
linetype = 1) +
theme_minimal() +
labs(title = "AEP Electricity Consumption (All Time Average in Red)", x = "Time", y = "MWh")
annual
With this volume of data, we can’t see much with the exception of some possible fluctuation linked to seasonality throughout the year.
Let’s take a look at annual average consumption:
annual_sum <- aggregate(TW ~ year, AEP_hourly, sum)
ggplot(annual_sum, aes(x = year, y = TW)) +
geom_line(alpha = 1, color = "#00AFBB") +
geom_hline(aes(yintercept = mean(annual_sum$TW)),
color = "#FC4E07") +
scale_x_continuous(breaks = annual_sum$year) +
theme_minimal() +
labs(title = "Average Annual Electricity Consumption (All Time Average in Red)",
x = "Year",
y = "TWh")
Average annual consumption seems to be on the decline throughout the years.
We can also take a look at monthly average consumption:
monthly_sum <- aggregate(TW ~ month + year, AEP_hourly, sum)
monthly_mean <- aggregate(TW ~ month, monthly_sum, mean)
ggplot(monthly_sum, aes(x = year, y = TW)) +
geom_point(alpha = 0.8, color = "#00AFBB") +
geom_hline(aes(yintercept = TW),
monthly_mean,
color = "#FC4E07", size = 1) +
theme_minimal() +
labs(title = "Monthly Average Electricity Consumption",
x = "Each month (1-12) combines 11 years (2007-2017)",
y = "TWh") +
theme(axis.text.x=element_blank()) +
facet_grid(cols = vars(month))
Without calculating trend or seasonality, we can already see some variation between seasons. On average, consumption has risen during the summer (July and August) and winter (January, February and December).
What about hourly variation?
We can calculate average hourly consumption:
hourly_sum <- aggregate(MW ~ hour + year, AEP_hourly, mean)
hourly_mean <- aggregate(MW ~ hour, hourly_sum, mean)
ggplot(hourly_sum, aes(x = year, y = MW)) +
geom_point(alpha = 0.6, color = "#00AFBB") +
geom_hline(aes(yintercept = MW),
hourly_mean,
color = "#FC4E07", size = 1) +
theme_minimal() +
labs(title = "Average Hourly Electricity Consumption",
x = "Each hour of the day (1-24) combines 11 annual averages (2007-2017)",
y = "MWh") +
theme(axis.text.x=element_blank()) +
facet_grid(cols = vars(hour))
We can also take a look at all datapoints (hours of consumption) for all hours of the day:
ggplot(AEP_hourly, aes(x = year, y = MW)) +
geom_point(alpha = 0.02, color = "#00AFBB") +
geom_hline(aes(yintercept = MW),
aggregate(MW ~ hour, AEP_hourly, mean),
color = "#FC4E07", size = 1) +
theme_minimal() +
labs(title = "All Data Points and Hourly Average Consumption, by Hour of the Day",
x = "Consumption by hour of the day (1-24) for all years (2007-2017)",
y = "Hourly MW Consumption") +
theme(axis.text.x=element_blank()) +
facet_grid(cols = vars(hour))
Without taking into account trend or seasonality, we can see variation between day and night. Consumption starts dropping around 22:00 and starts rising again around 6:00.
We can also see variation based on the days of the week:
wday_sum <- (aggregate(TW ~ wday + year, AEP_hourly, sum))
wday_mean <- aggregate(TW ~ wday, wday_sum, mean)
ggplot(wday_sum, aes(x = year, y = TW)) +
geom_point(alpha = 0.7, color = "#00AFBB") +
geom_hline(aes(yintercept = TW),
wday_mean,
color = "orangered") +
theme_minimal() +
labs(title = "Average Daily Electricity Consumption",
x = "Each day of the week (1-7) combines 11 years (2007-2017)",
y = "TWh") +
theme(axis.text.x=element_blank()) +
facet_grid(cols = vars(wday))
ggplot(AEP_hourly, aes(x = year, y = MW)) +
geom_point(alpha = 0.03, color = "#00AFBB") +
geom_hline(aes(yintercept = MW),
aggregate(MW ~ wday, AEP_hourly, mean),
color = "#FC4E07", size = 1) +
theme_minimal() +
labs(title = "All Data Points and Hourly Average Consumption, by Day of the Week",
x = "Consumption by day of the week (1 = Monday, 7 = Sunday) for all years (2007-2017)",
y = "MWh Consumption") +
theme(axis.text.x=element_blank()) +
facet_grid(cols = vars(wday))
Again, we see variation between days in the middle of the week and the first and last day of the week.
Before we start building our forecasting model, we first need to separate our dataset into training and testing datasets. Since we are working with time series data, we will be using our first 10 years of data to train our model and will use our 11th year to validate our model.
train <- AEP_monthly %>% filter(year >= 2007 & year < 2017)
test <- AEP_monthly %>% filter(year == 2017)
We will also transform our datasets to time series (ts) format:
# For the training dataset
train_ts <- train
train_ts$year = NULL
train_ts$month = NULL
train_ts <- ts(train_ts, frequency = 12, start = c(2007, 1))
# For the validation dataset
test_ts <- test
test_ts$year = NULL
test_ts$month = NULL
test_ts <- ts(test, frequency = 12, start = c(2017, 1))
We can now analyse our training dataset for trend and seasonality using the decompose function.
decomposed_train <- decompose(train_ts)
autoplot(decomposed_train)
Seasonality
We can see from decomposed data that our training dataset has seasonality.
We can plot seasonality by year:
ggseasonplot(train_ts, year.labels=TRUE, year.labels.left=TRUE) +
ylab("TWh") +
ggtitle("Seasonal plot: Electricity Consumption by Year") +
theme_minimal()
We can also plot our de-seasonalized data:
deseason_AEP = seasadj(decompose(train_ts))
autoplot(deseason_AEP, colour = "#00AFBB") + theme_minimal() + labs(x = NULL, y = "TWh")
Stationarity
Stationarity is harder to detect. We can check for stationarity using the augmented Dickey–Fuller (ADF) test and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test.
adf.test(train_ts)
##
## Augmented Dickey-Fuller Test
##
## data: train_ts
## Dickey-Fuller = -3.7401, Lag order = 4, p-value = 0.02426
## alternative hypothesis: stationary
kpss.test(train_ts)
##
## KPSS Test for Level Stationarity
##
## data: train_ts
## KPSS Level = 1.3123, Truncation lag parameter = 4, p-value = 0.01
Both ADF and KPSS tests indicate that our training data is stationary.
Because we are forecasting a seasonal time series, we will be building and testing the following four models:
First, we build our Holt-Winter model:
HW_model <- HoltWinters(train_ts)
Secondly, we use our model to forecast for the last year:
HW_forecast <- forecast(HW_model, h = 12)
We can now add our predictions to our validation dataset:
test$HW_model <- HW_forecast$mean %>% as.vector()
test
Finally, we can visualize our predictions and compare them to our actual values:
HW_plot <- ggplot(test, aes(x = month)) +
theme_minimal() +
geom_line(aes(y = TW, colour = "Actual")) +
geom_line(aes(y = HW_model, colour = "Prediction"), linetype = 2) +
scale_color_manual(values=c("#00AFBB", "#FC4E07")) +
theme(legend.title = NULL, legend.position = c(0.8, 0.8)) +
labs(x = "Month", y = "TWh", color = "", title = "Model 1: Holt-Winter Model") +
scale_x_continuous(breaks=c(1:12), labels=c(1:12),limits=c(1,12))
HW_plot
We now follow the same approach for all our the other models.
We can build our ARIMA model using the auto.arima function (there is also a manual function):
ARIMA_model <- auto.arima(train_ts)
We can take a look at the specifications and performance of our model (based on the training dataset).
summary(ARIMA_model)
## Series: train_ts
## ARIMA(1,0,0)(2,1,0)[12] with drift
##
## Coefficients:
## ar1 sar1 sar2 drift
## 0.4590 -0.4922 -0.2591 -0.0111
## s.e. 0.0895 0.1036 0.1134 0.0048
##
## sigma^2 estimated as 0.288: log likelihood=-85.94
## AIC=181.88 AICc=182.46 BIC=195.29
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.009057844 0.4996233 0.3778669 -0.2151784 3.319678 0.730864
## ACF1
## Training set -0.02219158
We then build our 12 month forecast, add the forecast to our test dataset and visualize the result.
ARIMA_forecast <- forecast(ARIMA_model, h = 12)
test$ARIMA_model <- ARIMA_forecast$mean %>% as.vector()
ARIMA_plot <- ggplot(test, aes(x = month)) +
geom_line(aes(y = TW, colour = "Actual")) +
geom_line(aes(y = ARIMA_model, colour = "Prediction"), linetype = 2) +
scale_color_manual(values=c("#00AFBB", "#FC4E07")) +
theme_minimal() +
theme(legend.title = NULL, legend.position = c(0.8, 0.8)) +
labs(x = "Month", y = "TWh", color = "", title = "Model 2: ARIMA Model") +
scale_x_continuous(breaks=c(1:12), labels=c(1:12),limits=c(1,12))
ARIMA_plot
set.seed(9)
ELM_model <- elm(train_ts)
We can take a look at the structure of our ELM Neural Network:
print(ELM_model)
## ELM fit with 100 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,8,9,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## Output weight estimation using: lasso.
## MSE: 0.2111.
plot(ELM_model)
The elm function creates a heavy looking neural network composed of 100 nodes in 1 hidden layer. Most of the connections are eliminated through a least squares shrinkage estimator. Only the connections in black are taken into account by the forecast. It also defines the number of entry nodes based on seasonality (light red nodes) and autoregressive lags (light gray nodes) (Source: Nikolaos Kourentzes).
We then forecast the next 12 months, add our data to the test dataset and visualize our result.
set.seed(9)
ELM_forecast <- forecast(ELM_model, h = 12)
test$ELM_model <- ELM_forecast$mean %>% as.vector()
ELM_plot <- ggplot(test, aes(x = month)) +
theme_minimal() +
geom_line(aes(y = TW, colour = "Actual")) +
geom_line(aes(y = ELM_model, colour = "Prediction"), linetype = 2) +
scale_color_manual(values=c("#00AFBB", "#FC4E07")) +
theme(legend.title = NULL, legend.position = c(0.8, 0.8)) +
labs(x = "Month", y = "TWh", color = "", title = "Model 3: ELM Neural Network Model") +
scale_x_continuous(breaks=c(1:12), labels=c(1:12),limits=c(1,12))
ELM_plot
We can build our MLP Neural Network Model using the mlp function from the nnfor package.
set.seed(8)
MLP_model <- mlp(train_ts)
We can take a look at the structure of our model:
print(MLP_model)
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,8,9,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 0.0127.
plot(MLP_model)
The mlp function defines the number of entry nodes based on seasonality (light red nodes) and autoregressive lags (light gray nodes), uses the median as combination function and optimizes the number of nodes and hidden layers. It then creates an ensemble model combining 20 repetitions (Source: Nikolaos Kourentzes).
We can then build a forecast using our model, add the forecast to the test dataset and visualize our result.
set.seed(8)
MLP_forecast <- forecast(MLP_model, h = 12)
test$MLP_model <- MLP_forecast$mean %>% as.vector()
test
MLP_plot <- ggplot(test, aes(x = month)) +
theme_minimal() +
geom_line(aes(y = TW, colour = "Actual")) +
geom_line(aes(y = MLP_model, colour = "Prediction"), linetype = 2) +
scale_color_manual(values=c("#00AFBB", "#FC4E07")) +
theme(legend.title = NULL, legend.position = c(0.8, 0.8)) +
labs(x = "Month", y = "TWh", color = "", title = "Model 4: MLP Neural Network Model") +
scale_x_continuous(breaks=c(1:12), labels=c(1:12),limits=c(1,12))
MLP_plot
We now have four forecasting models:
What about January and February?
We can see from this view that none of our models (except for February with our MLP Model) have been able to predict adequately the months of January and February. Looking back at our Average Monthly Electricity Consumption Plot in the Exploration Chapter, we can see that these months have unusally low electricity consumption values.
Based on our historical data, it would therefore have been impossible for our models to correctly predict those months.
We now also have a test dataset including all our predictions:
test
A first approach to evaluating performance
We can use this information to calculate the performance of our models using the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE) and the Mean Absolute Percentage Error (MAPE).
We can build performance functions for these indicators:
performance <- function(data, model_name, prediction, actual) {
data %>% summarise(
Model = model_name,
RMSE = sqrt(mean((prediction - actual)^2)),
MAE = mean(abs(prediction - actual)),
MAPE = mean(abs((prediction - actual)/actual)*100)
)
}
We then use these functions to build a performance table composed of our four models.
p1 <- performance(test, "Holt Winters", test$HW_model, test$TW)
p2 <- performance(test, "ARIMA", test$ARIMA_model, test$TW)
p3 <- performance(test, "ELM nn", test$ELM_model, test$TW)
p4 <- performance(test, "MLP nn", test$MLP_model, test$TW)
model_performance <- rbind(p1, p2, p3, p4)
model_performance
A second (faster) approach to evaluating performance
Another approach for this step is to simply use the summary function. In addition to the indicators presented above, this function also produces the Mean Error (ME) and the Mean Percentage Error (MPE).
p1.2 <- (accuracy(test$HW_model, test$TW))
p2.2 <- (accuracy(test$ARIMA_model, test$TW))
p3.2 <- (accuracy(test$ELM_model, test$TW))
p4.2 <- (accuracy(test$MLP_model, test$TW))
model_performance_2 <- rbind(p1.2, p2.2, p3.2, p4.2)
rownames(model_performance_2) <- c("Holt Winters", "ARIMA", "ELM nn", "MLP nn")
model_performance_2
## ME RMSE MAE MPE MAPE
## Holt Winters -0.2773796 0.7158364 0.4942192 -2.683157 4.685011
## ARIMA -0.1450301 0.7157607 0.5067510 -1.420228 4.722823
## ELM nn -0.3167866 0.6331524 0.4451059 -3.018101 4.169479
## MLP nn 0.3199681 0.7500514 0.6542487 3.100272 6.137138
Choosing our Final Model
Based on the results of our performance analysis, we can conclude that our ELM Neural Network (which has the lowest RMSE, MAE and MAPE) gives us the best predictions.
Finally, we can use our model to produce a forecast for the year 2018. But first, we need to rebuild our model using all available data (train and test datasets combined).
monthly_ts <- AEP_monthly %>% filter(year >= 2007 & year <= 2018)
monthly_ts <- monthly_ts$TW
monthly_ts <- ts(monthly_ts, frequency = 12, start = c(2007, 1))
set.seed(9)
str(monthly_ts)
## Time-Series [1:132] from 2007 to 2018: 12.9 12.9 11.9 10.9 11.3 ...
final_model <- elm(monthly_ts)
Then, we can produce our forecast and create a dataset which combines that forecast and our actual values.
# Build forecast
forecast_2018 <- forecast(final_model, h = 12)
# Integrate data in a dataframe
df_07_16 <- data.frame(train[1:3])
df_17 <- data.frame(test[1:3])
df_07_17 <- rbind(df_07_16, df_17)
df_07_17 <- data.frame(df_07_17,
forecast = as.numeric(NA),
date = as.Date(with(df_07_17, paste(year, month, 1,sep="-")), "%Y-%m-%d"))
df_18 <- data.frame(month = as.numeric(c(1:12)),
year = 2018,
TW = as.numeric(NA))
df_18$date = as.Date(with(df_18, paste(year, month, 1,sep="-")), "%Y-%m-%d")
df_18.2 <- forecast_2018$mean %>% as.numeric()
df_18$forecast <- df_18.2
df_plot_18 <- rbind(df_07_17[132,], df_18)
df_plot_18$forecast[1] <- df_plot_18$TW[1]
df_07_18 <- rbind(df_07_17, df_18)
tail(df_07_18, 24)
We now have a dataframe containing our historical values and our final 2018 forecast and can plot our final forecast. To facilitate visualization, we will only plot the 2015-2018 period.
df_15_18 <- df_07_18[97:144,]
df_07_17_short <- df_07_17[97:132,]
ggplot(data = df_15_18, aes(x = date)) +
theme_minimal() +
geom_line(data = df_07_17_short, aes(x = date, y = TW, group = 1, colour = "Actual"),
size = 1) +
geom_line(data = df_plot_18, aes(x = date, y = forecast, group = 1, colour = "Forecast"),
linetype = 2, size = 1) +
geom_point(data = df_07_17_short, aes(x = date, y = TW, group = 1, colour = "Actual"),
shape = 1) +
geom_point(data = df_plot_18, aes(x = date, y = forecast, group = 1, colour = "Forecast"),
shape = 1) +
scale_color_manual(values=c("#00AFBB", "#FC4E07")) +
theme(legend.position = c(0.9, 0.9)) +
labs(x = NULL, y = "TWh", color = "", title = "Our Final Electricity Consumption Forecast")