olsmt#
Purpose#
Computes a least squares regression.
Format#
- out = olsmt(data, formula[, ctl])#
- out = olsmt(data, depvar, indvars[, ctl])
- Parameters:
data (string or dataframe) – name of dataset, dataframe in memory, or an empty string. If data is an empty string,``””``, the procedure assumes that the actual data has been passed in the next two arguments.
formula (string) –
formula string of the model. E.g
"y ~ X1 + X2"
,y
is the name of dependent variable,X1
andX2
are names of independent variables;E.g
"y ~ ."
,.
means including all variables except dependent variabley
;E.g
"y ~ -1 + X1 + X2"
,-1
means no intercept model.depvar –
If data contains a string, then depvar can be a:
type
value
string
name of dependent variable
scalar
index of dependent variable. If scalar 0, the last column of the dataset will be used.
If data is an empty string, name of dataframe, or 0:
type
value
Nx1 vector
the dependent variable.
indvars (Kx1 vector or NxK matrix) –
If data contains a string:
type
value
Kx1 character vector
names of independent variables
Kx1 numeric vector
indices of independent variables. These can be any size subset of the variables in the data and can be in any order. If a scalar 0 is passed, all columns of the data will be used except for the one used for the dependent variable.
If data is an empty string, dataframe, or 0:
type
value
NxK matrix
the independent variables
ctl (struct) –
Optional input. instance of an
olsmtControl
structure containing the following members:ctl.altnam
string array, default
""
.This can be a \((K+1) \times 1\) or \((K+2) \times 1\) string array of alternate variable names for the output. If ctl.con is 1, and ctl.altnam has \((K+2)\) elements, then the first element will control the name displayed for the constant term. The name of the dependent variable is the last element.
ctl.con
scalar, default 1.
- 1:
a constant term will be added, \(D = K+1\).
- 0:
no constant term will be added, \(D = K\).
A constant term will always be used in constructing the moment matrix m.
ctl.cov
string, set covariance type. Default = “iid”.
"iid"
:Error terms assumed to be identical independently distributed.
"robust"
:Huber/White/sandwich estimator.
"cluster"
:Clustered sandwich estimator. Must specify cluster variable identifier.
"hac"
:HAC robust sandwich estimator.
ctl.clusterID
Matrix, vector of categorical group variable used for computing cluster robust standard errors.
ctl.clusterVar
String, name of cluster group variable. Only valid if dataset and formula is specified.
ctl.miss
scalar, default 0.
- 0:
there are no missing values (fastest).
- 1:
listwise deletion, drop any cases in which missings occur.
- 2:
pairwise deletion, this is equivalent to setting missings to 0 when calculating m. The number of cases computed is equal to the total number of cases in the dataset.
ctl.row
scalar, the number of rows to read per iteration of the read loop. Default 0.
If 0, the number of rows will be calculated internally. If you get an Insufficient memory error message while executing
olsmt()
, you can supply a value for ctl.row that works on your system.The answers may vary slightly due to rounding error differences when a different number of rows is read per iteration. You can use ctl.row to control this if you want to get exactly the same rounding effects between several runs.
ctl.vpad
scalar, default 1.
If 0, internally created variable names are not padded to the same length (e.g.
X1, X2,..., X10
). If 1, they are padded with zeros to the same length (e.g.,X01, X02,..., X10
).ctl.output
scalar, default 1.
- 1:
print the statistics.
- 0:
do not print statistics.
ctl.res
scalar, default 0.
- 1:
compute residuals (oOut.resid) and Durbin-Watson statistic (oOut.dwstat.)
- 0:
oOut.resid = 0, oOut.dwstat = 0.
ctl.rnam
string, default “_olsmtres”.
If the data is taken from a dataset, a new dataset will be created for the residuals, using the name in ctl.rnam.
ctl.maxvec
scalar, default 20000.
The largest number of elements allowed in any one matrix.
ctl.fcmptol
scalar, default 1e-12.
Tolerance used to fuzz the comparison operations to allow for round off error.
ctl.alg
string, default “cholup”.
Selects the algorithm used for computing the parameter estimates. The default Cholesky update method is more computationally efficient. However, accuracy can suffer for poorly conditioned data. For higher accuracy set ctl.alg to either qr or svd.
"qr"
:Solves for the parameter estimates using a qr decomposition.
"svd"
:Solves for the parameter estimates using a singular value decomposition.
ctl.weights
Nx1 Vector, if defined, specifies weights to be used in the weighted least squares. If not defined, ordinary least squares will be computed.
ctl.weightsVar
String, name of the variable used for weighting. Only valid if dataset and formula is specified. Will override any weights in ctl.weights.
- Returns:
out (struct) –
instance of
olsmtOut
struct containing the following members:out.vnam
\((K+2) \times 1\) or \((K+1) \times 1\) character vector, the variable names used in the regression. If a constant term is used, this vector will be \((K+2) \times 1\), and the first name will be
CONSTANT
. The last name will be the name of the dependent variable.out.m
MxM matrix, where \(M = K+2\), the moment matrix constructed by calculating
X'X
where X is a matrix containing all useable observations and having columns in the order:1.0
indvars
depvar
(constant)
(independent variables)
(dependent variable)
A constant term is always used in computing m.
out.b
Dx1 vector, the least squares estimates of parameters.
Error handling is controlled by the low order bit of the trap flag.
- trap 0:
terminate with error message
- trap 1:
return scalar error code in b
30
system singular
31
system underdetermined
32
same number of columns as rows
33
too many missings
34
file not found
35
no variance in an independent variable
The system can become underdetermined if you use listwise deletion and have missing values. In that case, it is possible to skip so many cases that there are fewer usable rows than columns in the dataset.
out.stb
Kx1 vector, the standardized coefficients.
out.vc
DxD matrix, the variance-covariance matrix of estimates.
out.stderr
Dx1 vector, the standard errors of the estimated parameters.
out.sigma
scalar, standard deviation of residual.
out.cx
\((K+1) \times (K+1)\) matrix, correlation matrix of variables with the dependent variable as the last column.
out.rsq
scalar, R square, coefficient of determination.
out.resid
residuals, \(out.resid = y - x * out.b\).
If ctl.olsres = 1, the residuals will be computed.
If the data is taken from a dataset, a new dataset will be created for the residuals, using the name in ctl.rnam. The residuals will be saved in this dataset as an Nx1 column. The out.resid return value will be a string containing the name of the new dataset containing the residuals. If the data is passed in as a matrix, the out.resid return value will be the Nx1 vector of residuals.
out.dwstat
scalar, Durbin-Watson statistic.
Examples#
Basic usage with matrices#
// Set y matrix
y = { 2,
3,
1,
7,
5 };
// Set x matrix
x = { 1 3 2,
2 3 1,
7 1 7,
5 3 1,
3 5 5 };
// Perform least squares regression and print report to the screen
// The empty string, "" indicates that no dataset is used
call olsmt("", y, x);
Ordinary Least Squares
====================================================================================
Valid cases: 5 Dependent variable: Y
Missing cases: 0 Deletion method: None
Total SS: 23.2 Degrees of freedom: 1
R-squared: 0.982 Rbar-squared: 0.928
Residual SS: 0.417 Std. err of est: 0.646
F(3,1): 18.2 Probability of F: 0.17
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT -3.1489 1.3221 -2.3818 0.25306 -5.7401 -0.55765
X1 1.1045 0.20065 5.5048 0.1144 0.71127 1.4978
X2 1.7278 0.28449 6.0734 0.10389 1.1702 2.2854
X3 -0.75342 0.15111 -4.9859 0.12601 -1.0496 -0.45725
====================================================================================
Basic usage with a dataset and a formula string#
// Create string with the name and full file path of the dataset
dataset = getGAUSSHome("examples/detroit.dta");
// Create formula string specifying dependent and independent variables
formula = "homicide ~ unemployment + hourly_earn";
// Perform estimation
call olsmt(dataset, formula);
In this example, the dataset detroit.dta
is used to compute a
regression. The dependent variable is homicide. The independent variables are: unemployment and hourly_earn. The output is:
Ordinary Least Squares
=====================================================================================
Valid cases: 13 Dependent variable: homicide
Missing cases: 0 Deletion method: None
Total SS: 3.22e+03 Degrees of freedom: 10
R-squared: 0.834 Rbar-squared: 0.801
Residual SS: 534 Std. err of est: 7.31
F(2,10): 25.2 Probability of F: 0.000125
=====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
-------------------------------------------------------------------------------------
CONSTANT -35.983 9.4372 -3.8128 0.0034133 -54.48 -17.486
unemployment -0.0049983 0.91882 -0.0054399 0.99577 -1.8059 1.7959
hourly_earn 15.487 2.2427 6.9057 4.1653e-05 11.092 19.883
=====================================================================================
Basic usage with a dataframe and categorical variable#
// Load data
fname = getGAUSSHome("examples/auto2.dta");
auto2 = loadd(fname);
// Include the `rep78` categorical variable
call olsmt(auto2, "price ~ mpg + rep78");
In this example, the dependent variable price is regressed on mpg and rep78. The categorical variable rep78 will automatically be included in the OLS regression as a dummy variable with the base case excluded from the regression. The coefficients for the categories, Fair, Average, Good, Excellent are included in the printed output table. The Poor category is excluded from the regression, as it is the base case.
Ordinary Least Squares
=========================================================================================
Valid cases: 69 Dependent variable: price
Missing cases: 5 Deletion method: Listwise
Total SS: 5.77e+08 Degrees of freedom: 63
R-squared: 0.258 Rbar-squared: 0.199
Residual SS: 4.28e+08 Std. err of est: 2.61e+03
F(5,63): 4.39 Probability of F: 0.00172
=========================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
-----------------------------------------------------------------------------------------
CONSTANT 10450 2251 4.6423 1.7973e-05 6038 14862
mpg -280.26 61.577 -4.5514 2.4902e-05 -400.95 -159.57
rep78: Fair 877.63 2063.3 0.42536 0.67203 -3166.4 4921.7
rep78: Average 1425.7 1905.4 0.7482 0.45712 -2309 5160.3
rep78: Good 1693.8 1942.7 0.87191 0.38657 -2113.8 5501.5
rep78: Excellent 3132 2041 1.5345 0.12991 -868.47 7132.4
=========================================================================================
Estimate a linear model for each subset of a categorical variable#
In this example, we will regress mpg and weight on price for the case where foreign equals “Domestic” and for the case where foreign equals “Foreign”, using the by
keyword.
// Load specific variables
fname = getGAUSSHome("examples/auto2.dta");
auto2 = loadd(fname, "price + mpg + weight + foreign");
// Estimate a linear model for each value of `foreign`
call olsmt(auto2, "price ~ mpg + weight + by(foreign)");
Ordinary Least Squares
====================================================================================
foreign: Domestic
====================================================================================
Valid cases: 52 Dependent variable: price
Missing cases: 0 Deletion method: None
Total SS: 4.89e+08 Degrees of freedom: 49
R-squared: 0.483 Rbar-squared: 0.462
Residual SS: 2.53e+08 Std. err of est: 2.27e+03
F(2,49): 22.9 Probability of F: 9.58e-08
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT -13285 5726 -2.3202 0.024538 -24508 -2062.4
mpg 237.69 139.03 1.7096 0.093667 -34.815 510.2
weight 4.415 0.94839 4.6553 2.4949e-05 2.5562 6.2739
====================================================================================
Ordinary Least Squares
====================================================================================
foreign: Foreign
====================================================================================
Valid cases: 22 Dependent variable: price
Missing cases: 0 Deletion method: None
Total SS: 1.44e+08 Degrees of freedom: 19
R-squared: 0.785 Rbar-squared: 0.763
Residual SS: 3.1e+07 Std. err of est: 1.28e+03
F(2,19): 34.8 Probability of F: 4.45e-07
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT -5065.8 3202.5 -1.5818 0.13019 -11343 1211.1
mpg -19.777 57.681 -0.34287 0.73546 -132.83 93.278
weight 5.1558 0.88069 5.8543 1.2249e-05 3.4297 6.882
====================================================================================
Use a dataset, a list of variable names plus a control and output structure.#
new;
// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();
// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the Durbin-Watson statistic
ols_ctl.res = 1;
// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;
// Create string with the name and full file path of the dataset
data = getGAUSSHome("examples/credit.dat");
// Create a string with the name of the dependent variable
depvar = "Limit";
// Create 3x1 string array, containing the dependent variable names
indvars = "Balance" $| "Income" $| "Age";
// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(data, depvar, indvars, ols_ctl);
In this example, the dataset credit.dat
is used to compute a
regression. The dependent variable is Limit. The independent
variables are: Balance, Income, and Age. The residuals and Durbin-Watson statistic will be computed.
Ordinary Least Squares
====================================================================================
Valid cases: 400 Dependent variable: Limit
Missing cases: 0 Deletion method: None
Total SS: 2.13e+09 Degrees of freedom: 396
R-squared: 0.939 Rbar-squared: 0.939
Residual SS: 1.3e+08 Std. err of est: 572
F(3,396): 2.03e+03 Probability of F: 5.24e-240
Durbin-Watson: 1.95
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT 1521.9 102.23 14.887 4.0076e-40 1321.5 1722.3
Balance 3.1685 0.070635 44.857 2.5832e-157 3.03 3.3069
Income 32.567 0.93593 34.797 1.6866e-122 30.733 34.401
Age 1.6779 1.6943 0.9903 0.32263 -1.6429 4.9987
====================================================================================
Use a dataset and variable indices#
// Set dataset name
dataset = getGAUSSHome("examples/credit.dat");
// Set the third variable in 'credit.dat', 'Rating'
// to be the dependent variable
depvar = 3;
// Set the first, second and fifth variables in 'credit.dat'
// to be the independent variables
indepvar = { 1, 2, 5 };
call olsmt(dataset, depvar, indepvar);
The above code will produce the following output:
Ordinary Least Squares
====================================================================================
Valid cases: 400 Dependent variable: Rating
Missing cases: 0 Deletion method: None
Total SS: 9.55e+06 Degrees of freedom: 396
R-squared: 0.994 Rbar-squared: 0.994
Residual SS: 5.94e+04 Std. err of est: 12.2
F(3,396): 2.11e+04 Probability of F: 0
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT 37.676 2.4157 15.596 4.3018e-43 32.941 42.41
Income 0.018253 0.028857 0.63254 0.5274 -0.038306 0.074812
Limit 0.066587 0.00043602 152.72 0 0.065733 0.067442
Age 0.019892 0.036174 0.5499 0.5827 -0.051009 0.090792
====================================================================================
Basic usage with weights#
new;
// Define data
parent = { 0.21, 0.2, 0.19, 0.18, 0.17, 0.16, 0.15 };
progeny = { 0.1726, 0.1707, 0.1637, 0.164, 0.1613, 0.1617, 0.1598 };
sd = { 0.01988, 0.01938, 0.01896, 0.02037, 0.01654, 0.01594, 0.01763 };
// Calculate weights
weights = 1 ./ SD.^2;
// Set up olsControl structure
struct olsmtControl ctl;
ctl = olsmtControlCreate();
ctl.weights = weights;
call olsmt("", progeny, parent, ctl);
The above code will produce the following output:
Ordinary Least Squares
====================================================================================
Valid cases: 7 Dependent variable: Y
Missing cases: 0 Deletion method: None
Total SS: 572 Degrees of freedom: 5
R-squared: 0.852 Rbar-squared: 0.823
Residual SS: 0.0605 Std. err of est: 0.11
F(1,5): 28.8 Probability of F: 0.0018
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT 0.12796 0.0068112 18.787 7.8687e-06 0.11461 0.14131
X1 0.2048 0.038155 5.3676 0.0030205 0.13002 0.27958
====================================================================================
Remarks#
For poorly conditioned data the default setting for ctl.olsalg, using the Cholesky update, may produce only four or five digits of accuracy for the parameter estimates and standard error. For greater accuracy, use either the qr or singular value decomposition algorithm by setting ctl.olsalg to
qr
orsvd
. If you are unsure of the condition of your data, set ctl.olsalg toqr
.No output file is modified, opened, or closed by this procedure. If you want output to be placed in a file, you need to open an output file before calling
olsmt()
.The supported dataset types are CSV, XLS, XLSX, HDF5, FMT, DAT
For HDF5 file, the dataset must include
file schema
and both file name and dataset name must be provided, e.g.olsmt("h5://C:/gauss/examples/testdata.h5/mydata", formula)
Source#
olsmt.src
See also
Functions glm()
, gmmFitIV()
, olsmtControlCreate()
, olsqrmt()
, Formula string
, clusterSE()
, robustSE()