Summary

This operator calculates a regression model for a time series. The regression model tries to explain observations with the help of trends, seasonal variations and other influencing factors, and creates a forecast for future dates.

A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals [1], e.g., temperature measured at every hour, or weekly sales figures.

The seasonal regression model allows also to analyze whether the observed data depend on other influencing factors (e.g., training, marketing campaigns, number of employees).

A detailed description of regression analysis methods can be found at [2].

Requirements for data to be analyzed

  • Input data must be scaled, i.e., the time stamps in columns 'Date + time (from)' and 'Date + time (to)' represent uniform time intervals of the same duration. (see Scaling 8.0)
  • Input data must be sorted by time stamps in the columns 'Date + time (from)' and 'Date + time (to)'.
  • The columns 'Date + time (from)' and 'Date + time (to)' must not contain missing entries.
  • The underlying data must comprise at least one complete season.
  • If one or several observations are missing, the corresponding time stamps will be inserted into the data. The inserted entries are treated as missing observations.

Configuration

Input settings of existing table

Name

Value

Opt.

Description

Example

Identifier

System.Object

opt.

Columns by whose content the data is to be grouped. A separate regression analysis is performed for each group.

-

Date + Time (from)

System.DateTime

-

Column containing the start times (date + time) of the observations.

-

Date + Time (to)

System.DateTime

-

Column containing the end times (date + time) of the observations.

-

Dinstinction column

System.String

opt.

This column distinguishes observations from history from observations in the forecast period. Observations from history are marked with "H", observations in the forecast period with "F". This column only needs to be specified if the regression model contains an independent variable.

-

Influencing factors

System.Object

opt.

These columns represent those factors which influence the measured observations. The regression analysis tries to establish a (linear) relationship between these factors and the observations. These columns must not include more than one non-numerical column, which contains precisely two different values. These two values are 0-1 coded.

(warning) Note: more than one influencing factors are possible, values can be a double, e.g. representing the strength of the factor. 

(warning) Note: the sum of a subset of influecing factors must not be 1. E.g. 3 categories should be modeled in 2 factors.

(warning) Following data properties potentially causes numeric problems:

  • a*F = 1 or a*F ~ 1
  • sum(aj * Fj) = 1 or sum(aj * Fj) ~ 1

-

Observations

System.Double

-

This (numeric) column contains the values, which have been observed in each interval of the time series on which the model is based.

-

Settings

Name

Value

Opt.

Description

Example

Duration of the season

System.String

  • 1 year
  • 52 weeks
  • 1 week
  • 3 months
  • 1 month
  • 1 day
  • 12 hours
  • 8 hours
  • 6 hours
  • 1 hour

-

Please enter the duration of the season on which your data is based.

52 weeks

@BOXCOXTRANSFORMATION

System.String

  • Automatic
  • 1 (identity)
  • 0.5 (root)
  • 0 (logarithm)
  • -1 (reciprocal value)

-

The BOX-COX transformation converts the observations into a form that can be used for the regression analysis. If you select 'automatic' as the type of BOX-COX transformation, a suitable transformation for the data on which the analysis is based is defined. Another type of transformation should then only be applied if you have sound knowledge of the data to be analysed. Select '1 (identity)', if the seasonal variations remain absolutely constant, e.g. the December value is always 1000 units higher than the annual average. Select '0 (logarithm)', if the seasonal variations are a constant percentage, e.g. the December value is always 20% higher than the annual average. Select '0.5 (root)', if the seasonal variations are a constant percentage, but the percentage rate reduces slightly over time, e.g. the first December value is 20% higher than the monthly average, the second December value is 19% higher, etc. Select '-1 (reciprocal)' for non-negative data with falling trend, for example, as for the sales data from a book shop, where high values are observed during the first few months after publications, but then fall with time.

Automatic

Forecast period

System.Int32

-

Please enter the period, e.g. the next 6 weeks, for which a forecast is to be calculated.

26

Time Unit

System.String

  • Year(s)
  • Month(s)
  • Week(s)
  • Day(s)
  • Hour(s)

-

Please enter the period, e.g. the next 6 weeks, for which a forecast is to be calculated.

Weeks(s)

Deliver as result

System.String

  • @Forecast
  • Forecast + History
  • Forecast + hist. forecast
  • Forecast + history + hist. forecast + confidence interval
  • Only parameter estimation
  • Statistics only

-

Please select which data should be displayed in the results.

Forecast + History

Validate model over

System.String

  • Last 1/4 forecast period
  • Last 1/2 forecast period
  • Last forecast period
  • Last 2 forecast periods
  • Last 3 forecast periods
  • Each forecast period

-

The observations are divided into a validation period and training period, e.g. validation period = the last 52 weeks of the data on which the model is based, training period = all other data. A regression model is created on the basis of the training period, and this model is used to estimate the data of the training and validation period. The percentage and average errors between the estimated and actual observations are determined for the training and validation period. If the errors for the training and validation period are very different, your regression model contains too many variables and returns inaccurate forecasts (overfitting). In this case, please try to simplify your model by removing one or more variables.

Each forecast period

Adjusted R²

System.Double

-

The adjusted determinacy, adj. R², provides information about the quality of a regression model, i.e. how well the regression model explains the data on which it is based. A value adj. R² = 0.95 means that 95% of the fluctuations in the data can be explained by the regression. Please enter in this field a minimum limit for the adj. R² of a regression model. If the adj. R² of a model exceeds this limit, the regression model is discarded and the forecast is created using the mean value of the data.

0.6

ANOVA p-value

System.Double

-

Apart from the adjusted R², the p values of the ANOVA is another indicator of the quality of a regression model. High p values (e.g. larger than 0.1) indicate that the regression poorly explains the data on which it is based. Please enter an upper limit for the p value of the ANOVA. If this value is exceeded, the model is discarded and the forecasts are estimated using the mean value of the data.

0.1

p values (influencing factors)

System.Double

-

The p value of an independent variable provides information about whether the observations on which the data is based depends on any of these variables or not. Low values (e.g. less than 0.05) indicate a relationship between independent and dependent variables, high values on the other hand indicate that there are no relationships whatsoever. In this field, please enter a limit for the p value of a single independent variable. If the p value exceeds this limit, this variable will automatically be excluded from the regression model, provided you have activated 'excluded variables'.

0.05

Include seasonal variations

System.String

  • none
  • little
  • middle
  • strong

-

The regression model tries to identify seasonal variations and to take these into account for the forecast.

(warning) Note: danger of overfitting with strong

middle

Include intercept

System.Boolean

-

Please do not close the point of intersection unless you are really sure that it does not play any role in your observations.

True

Linear trend

System.Boolean

-

The regression model contains a linear trend, i.e. the observed values rise/fall linearly along the time axis.

(warning) Note: wrong linear trend might occur due to extreme values in starting and ending of period, so try with and without linear trend

True

Include quadratic trend

System.Boolean

-

The regression model contains a quadratic trend, i.e. the observed values rise/fall quadratically along the time axis.

False

Include logarithmic trend

System.Boolean

-

The regression model contains logarithmic trend, i.e. the observed values rise/fall logarithmically along the time axis.

False

Include day in year

System.Boolean

-

The location of a date within the year is taken into account in the regression model, e.g. 3.1.2010 is the third day of the year.

False

Include day in month

System.Boolean

-

The location of a date within the month is taken into account in the regression model , e.g. 3.1.2010 is the third day in January.

False

Include day in week

System.Boolean

-

The effect of different week days is taken into account in the regression model.

False

Include quarters

System.Boolean

-

Quarterly periods are taken into account in the regression model.

False

Summer time

System.Boolean

-

Summer time is taken into account in the regression model.

False

Include previous time interval

System.Boolean

-

The regression analysis examines, whether observations in your previous observations are affected in the scale.

False

Include previous day

System.Boolean

-

The regression analysis examines whether observations are affected by the observations on the previous day.

False

Previous week

System.Boolean

-

The regression analysis examines whether observations are affected by the observations in the previous week.

(warning) Note: may easily lead to exponential growth, so disable in doubt.

False

Previous year

System.Boolean

-

The regression analysis examines whether observations are affected by the observations in the previous year.

False

Exclude insignificant influencing factors

System.Boolean

-

Influencing factors whose p values exceed the limit given in the 'p values field (influencing factors)' will be excluded from the regression model.

(warning) Note: starting from full model and iteratively exclude factor with highest p value greater than threshold

True

Ignore 0 values

System.Boolean

-

Rows with 0 values are ignored and are treated like missing data.

False

Output the validation result in data nodes

System.Boolean

-

If selected, the validation results are displayed in a separate data node.

False

Output error messages/warnings in data nodes

System.Boolean

-

If selected, error messages and warnings are displayed in a separate data node.

False

Remarks

  • Additional information, such as training and validation errors, and warnings, are reported in the description field of the operation within the TIS-GUI.
  • If the column of an influencing factor contains the same constant value for each observation no regression analysis can be carried out. Thus, constant factors are excluded automatically.


Want to learn more?

This operator calculates a regression model for a time series. The regression model tries to explain observations with the help of trends, seasonal variations and other influencing factors, and creates a forecast for future dates.

Examples

Example: Regression model with trend and season

Situation

We consider the number of incoming phone calls at a call center over a time period. We are given the number of incoming calls for each week during the last four years.

  • We see that in general the number of calls has been rising during the last years. However, the number of incoming calls varies strongly over an entire year. Usually, the number of incoming calls is greater during the first half year than during the second half year.
  • With the regression analysis operation we want to build a model explaining the number of incoming calls over the last four years. Furthermore, we want to estimate and forecast the number of incoming calls for future periods of time.

Settings

  • Add the operation 'Regression analysis 6.0' to the current data node.
  • A single season seems to last an entire year. Since we are dealing with weekly data, we must select '52 weeks' for the Duration of the season.
  • As a result of the operation we would like to obtain the historical number of phone calls, the forecast for the next season (= 52 weeks), and the historical forecasts, i.e., the estimated number of incoming phone calls for the last four years. For that purpose we select the option 'Forecast + history + hist. forecast + confidence intervals' as the result to be delivered by the operation.
  • In this step we want to start with a simple regression model. Thus, our model shall only include a linear trend and season.
  • All other settings keep their default values

Result

The result table shows the forecast (G), and the lower and upper limits of the confidence intervals (E and F).

By adding a chart operation Chart: Histogram Time Pattern the visualization shows that the historical forecasts (orange line) estimated with the obtained regression model show a similar trend and seasonal behaviour as the historic observations (blue line). The red line shows the estimated number of phone calls for the next 52 weeks.

Project-File

Confluence Op Regression seasonal 6.0.gzip


Example: Regression model with trend, season and influencing factors

Situation

  • In the previous example many high values within the history could not be explained by the regression model with linear trend and season.
  • After careful investigation of the data, we suspect that the high historical values occur in weeks in which the company runs advertising campaigns for their products. Thus we extended our original input table for the regression analysis by an additional column, which contains a '1' whenever a campaign took place in the respective week and an '0' otherwise. Moreover, we recorded also the campaigns planned for the next 52 weeks and inserted a distinction column in order to separate historical observations (marked with an 'H') from future entries (marked with an 'F').

On the basis of that modified input we can now build an extended regression model. We specify advertising campaigns as an addtional influencing factor and the distinction column in the settings of the regression operation.

Settings

Result

The resulting data node shows the forecast (G), and the lower and upper CI limits (E, F).

In the histogram visualizing the results (added with the operation Chart: Histogram Time Pattern) we see that campaigns could indeed explain an increased number of incoming calls in the past (compare orange and blue line). Also the forecast for the next 52 weeks (red line) contains high values whenever an advertising campaign will be run.

Example: Comparison of different regression models

Situation

To be able to compare two or several regression models with each other, operation regression analysis 6.0 validates each model in the following manner.

  • The historical observations are divided into a training set and a validation set.
  • A regression model is built on the basis of the data within the training set.
  • With the obtained model we estimate the observations within the validation set and compute the errors between the estimated values and the actual observations.
  • From these errors we compute the error percentage as well as the average error of a model. These errors are reported within a separate data node if 'Output the validation results in data nodes' has been selected within the operator settings

Settings

Result

If we consider and compare the validation error associated with the two models from example 1 and example 2 we see that the validation error could be reduced by including advertising campaigns within the model for example 2.

-

 In general, one regression model should be preferred against the other only if it has a significantly lower validation error.

Troubleshooting

Problem

Frequent Cause

Solutions

Error message: "Object reference not set to an instance of an object."

in TIS 5.8.2, this error message occurs when the box "Output the validation result in data nodes" is checked. This is a bug, see ticket BugFLEX-505 - Regressionsanalyse 6.0 - Fehler Validierung AUCH in Regression Open

-

Forecast for excluded days, e.g., holiday.

The operator does not consider if a day in the future period is a day to be excluded or not.

In our TIS Forecast solution not additional influcencing factors for days to be exlcuded are provided. If a regression model is based on at least of these influencing factor, the forecasted value for that day will be null.

If a regression model does not use any additional influencing factors, e.g., they are all elminated due a high p-value, then will be a forecasted value for an excluded day.

In the TIS forecasting solution those forecasted rows representing excluded days will be eliminated after the regression forecast (merge data, rows without a common key in data node 2).

Related topics