Formula Strings#
Formula strings allow you to represent a model or a collection of variables in a compact, readable way, using the variable names in the dataset. They are used throughout GAUSS for loading data, specifying regression models, computing descriptive statistics, and controlling plot layouts.
Prerequisites
This page assumes you can load data with loadd(). If you are new
to GAUSS, see GAUSS Quickstart first.
// Load two variables by name
x = loadd(getGAUSSHome("examples/cancer.dat"), "stage + count");
// Specify a regression model (call discards the return value
// and prints the report directly)
call olsmt(getGAUSSHome("examples/credit.dat"), "Limit ~ Income + Rating");
// Compute descriptive statistics for selected variables
call dstatmt(getGAUSSHome("examples/auto2.dta"), "mpg + weight + price");
Formula strings work with dataframes — matrices that carry column names
and types, returned by loadd(). Because each column has a name,
formula strings can reference those names directly for subsetting,
estimation, and plotting.
Basic Syntax#
A formula string is a quoted string containing variable names and operators.
Selecting variables for loading or statistics:
// Load all variables
df = loadd(fname, ".");
// Load specific variables
df = loadd(fname, "Income + Rating + Cards");
// Load all variables except one
df = loadd(fname, ". - Cards");
Specifying a model with dependent and independent variables:
The tilde ~ separates the dependent (response) variable on the left
from the independent (explanatory) variables on the right:
// Limit = alpha + beta_1 * Income + beta_2 * Rating + epsilon
call olsmt(fname, "Limit ~ Income + Rating");
// Use all remaining variables as predictors
call olsmt(fname, "Limit ~ .");
Operators#
The following operators are available in formula strings. The examples assume a dataset containing three variables: Limit, Income, and Rating.
Operator |
Description |
|---|---|
|
Include all variables. When used after // These two are equivalent
"Limit ~ Income + Rating"
"Limit ~ ."
|
|
Add a variable. "Income + Rating"
|
|
Remove a variable. // These two are equivalent
"Limit ~ Rating"
"Limit ~ . - Income"
|
|
Create a pure interaction term between two variables (no main effects). "Limit ~ Income:Rating"
|
|
Include both main effects and the interaction term. // These two are equivalent
"Limit ~ Income * Rating"
"Limit ~ Income + Rating + Income:Rating"
|
|
Represents the intercept. Estimation functions such as // Remove the intercept
"Limit ~ -1 + Income + Rating"
|
|
Indicates that a variable should be loaded as a string. Typically
used with the // Load 'order_time' as a string and interpret as a date
"date($order_time)"
|
Keywords#
factor#
The factor keyword tells GAUSS that a numeric variable represents
categories. In estimation functions such as olsmt() and glm(),
GAUSS will automatically create dummy variables for each level, using
the first level as the base case.
// Load data
fname = getGAUSSHome("examples/auto2.dta");
// Create dummy variables from 'rep78'
call olsmt(fname, "mpg ~ weight + factor(rep78)");
The above code prints a regression report with separate coefficient
estimates for each non-base-case level of rep78:
Ordinary Least Squares
=========================================================================================
Valid cases: 69 Dependent variable: mpg
Missing cases: 5 Deletion method: Listwise
Total SS: 2.34e+03 Degrees of freedom: 63
R-squared: 0.672 Rbar-squared: 0.646
Residual SS: 768 Std. err of est: 3.49
F(5,63): 25.8 Probability of F: 4.6e-14
=========================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
-----------------------------------------------------------------------------------------
CONSTANT 38.059 3.0934 12.304 2.0913e-18 31.996 44.122
weight -0.005503 0.000601 -9.1564 3.4748e-13 -0.006681 -0.0043251
rep78: Fair -0.4786 2.765 -0.17309 0.86313 -5.8981 4.9409
rep78: Average -0.47156 2.5531 -0.1847 0.85406 -5.4757 4.5326
rep78: Good -0.59903 2.6066 -0.22981 0.81898 -5.708 4.5099
rep78: Excellent 2.0863 2.7248 0.76566 0.44674 -3.2544 7.4269
=========================================================================================
cat#
The cat keyword tells GAUSS that a string variable in a file (such as
CSV or XLSX) should be reclassified to integer categories. This is useful
for file types that do not store variable type information.
cat can be combined with factor to load string data, convert it to
integer categories, and create dummy variables in a single step. Read the
nesting from the inside out: cat(load) converts the strings to integers
first, then factor(...) creates dummy variables from those integers:
fname = getGAUSSHome("examples/yarn.xlsx");
// cat(load): 'high, med, low' → integers
// factor(cat(load)): integers → dummy variables for OLS
call olsmt(fname, "cycles ~ factor(cat(load))");
The output:
Ordinary Least Squares
====================================================================================
Valid cases: 27 Dependent variable: cycles
Missing cases: 0 Deletion method: None
Total SS: 2.02e+07 Degrees of freedom: 24
R-squared: 0.0866 Rbar-squared: 0.0105
Residual SS: 1.85e+07 Std. err of est: 877
F(2,24): 1.14 Probability of F: 0.337
====================================================================================
Standard Prob Lower Upper
Variable Estimate Error t-value >|t| Bound Bound
------------------------------------------------------------------------------------
CONSTANT 534.44 292.47 1.8273 0.080113 -38.806 1107.7
load: low 621.56 413.62 1.5027 0.14596 -189.14 1432.3
load: med 359.11 413.62 0.86821 0.39388 -451.59 1169.8
====================================================================================
You can specify a base case explicitly when loading with cat:
fname = getGAUSSHome("examples/yarn.xlsx");
// Load with 'med' as the base case
df = loadd(fname, "cycles + cat(load, 'med')");
// Now 'med' is the reference level in the regression
call olsmt(df, "cycles ~ factor(load)");
date#
The date keyword tells GAUSS that a column contains date information.
GAUSS will try to match one of approximately 30 recognized date patterns
and convert the values to POSIX date/time format (seconds since
January 1, 1970).
Note
loadd() auto-detects dates for most common formats. The date
keyword is an override for cases where auto-detection fails or you need
to force a specific interpretation.
fname = getGAUSSHome("examples/yellowstone.csv");
// The $ tells GAUSS that 'Date' is stored as a string.
// The date() keyword tells GAUSS to interpret it as a date.
dates = loadd(fname, "date($Date)");
// Preview the first 4 dates
print dates[1:4];
This prints:
Date
2016/01/01
2015/01/01
2014/01/01
2013/01/01
Dates are stored internally as POSIX date/time values (seconds since
January 1, 1970), but GAUSS displays them in human-readable format
automatically. Use dtYear(), dtMonth(), and
dtDayofWeek() to extract date components.
You can also specify a date format explicitly when GAUSS cannot auto-detect the pattern:
// Specify the format directly
dates = loadd(fname, "date($Date, '%Y-%m-%d')");
by#
The by keyword groups data by the values of a variable. It is
supported by descriptive statistics functions such as dstatmt()
and plotting functions such as plotScatter() and plotXY().
// Load data as a dataframe
df = loadd(getGAUSSHome("examples/auto2.dta"));
// Compute descriptive statistics for 'mpg', grouped by 'foreign'
call dstatmt(df, "mpg + by(foreign)");
In plotting functions, by creates separate series for each group:
// Load the data
df = loadd(getGAUSSHome("examples/auto2.dta"));
// Scatter plot of mpg vs weight, colored by 'foreign'
plotScatter(df, "mpg ~ weight + by(foreign)");
Data Transformations#
You can apply any GAUSS function to a variable inside a formula string. The function must accept a single column vector as input and return a column vector of the same length.
// Use the natural log of height
"weight ~ ln(height)"
// Square root transformation
"y ~ sqrt(x1) + x2"
// Lag a variable (for time series)
"y ~ lag(x)"
You may use any built-in or user-defined GAUSS procedure. Because the
formula string is not parsed until runtime, procedures that are not
referenced elsewhere in the code may not be compiled. If you get the error
"Undefined proc", add an external proc declaration:
// Declare the procedure so the compiler includes it
external proc myTransform;
call olsmt(fname, "y ~ myTransform(x1) + x2");
Model Specification#
Multiple variable regression#
For the following examples, assume a dataset containing: admit, gre, gpa, and rank.
Model |
Formula string |
|---|---|
\(\text{admit} = \alpha + \beta_1 \text{gre} + \beta_2 \text{gpa} + \beta_3 \text{rank} + \varepsilon\) |
|
\(\text{admit} = \alpha + \beta_1 \text{gre} + \beta_2 \text{gpa} + \varepsilon\) |
|
Keywords and transformations#
Model |
Formula string |
|---|---|
\(\text{admit} = \alpha + \beta_1 \ln(\text{gre}) + \beta_2 D_{\text{rank}=2} + \beta_3 D_{\text{rank}=3} + \beta_4 D_{\text{rank}=4} + \varepsilon\) Log-transform |
|
\(\text{admit} = \alpha + \beta_1 \text{gre} + \beta_2 \text{gpa} + \beta_3 (\text{gre} \times \text{gpa}) + \varepsilon\) Include main effects and their interaction. |
|
Controlling the intercept#
Estimation functions include an intercept by default. Remove it with
-1:
Model |
Formula string |
|---|---|
\(\text{admit} = \beta_1 \text{gre} + \beta_2 \text{gpa} + \beta_3 \text{rank} + \varepsilon\) |
|
\(\text{admit} = \beta_1 \text{gre} + \beta_2 \text{gpa} + \varepsilon\) |
|
Multiple dependent variables#
Some functions use additional tildes to separate groups of variables.
The first tilde always separates the primary dependent variable from the
independent variables, and a second tilde introduces a secondary
dependent variable or grouping variable. Functions that accept this
syntax include quantileFit() (endogenous variables) and
clusterSE() (cluster identifier):
// Primary dependent ~ secondary dependent ~ regressors
"y1 ~ y2 ~ x1 + x2"
Formulas without a dependent variable#
Some operations, such as computing descriptive statistics or loading data, do not have a dependent variable. In these cases, omit the tilde:
Variables to include |
Formula string |
|---|---|
All variables |
|
gre and rank only |
|
Complete Examples#
The following examples show complete, runnable code combining formula strings with common GAUSS workflows. Each uses a built-in dataset.
Example 1: OLS with formula strings#
// Load data as a dataframe
fname = getGAUSSHome("examples/credit.dat");
df = loadd(fname);
// Preview the data
head(df);
// Estimate: Limit = alpha + beta_1 * Income + beta_2 * Rating + epsilon
call olsmt(df, "Limit ~ Income + Rating");
Example 2: GLM with categorical variables#
// Load data and remove missing values
df = loadd(getGAUSSHome("examples/auto2.dta"), "mpg + weight + rep78");
df = packr(df);
// Fit a GLM: mpg depends on weight and categorical rep78
struct glmOut out;
out = glm(df, "mpg ~ weight + factor(rep78)", "normal");
Example 3: Loading and transforming data#
fname = getGAUSSHome("examples/credit.dat");
// Load selected variables with a transformation
df = loadd(fname, "Limit + ln(Income) + Rating");
// View the first few rows
head(df);
Example 4: Descriptive statistics by group#
// Load data as a dataframe
df = loadd(getGAUSSHome("examples/auto2.dta"));
// Descriptive statistics for mpg and weight, grouped by foreign
call dstatmt(df, "mpg + weight + by(foreign)");
Example 5: Scatter plot with grouping#
// Load data
df = loadd(getGAUSSHome("examples/auto2.dta"));
// Scatter plot with groups
plotScatter(df, "mpg ~ weight + by(foreign)");
Example 6: Quantile regression#
fname = getGAUSSHome("examples/credit.dat");
// Quantile regression at the median
struct qfitOut out;
out = quantileFit(fname, "Balance ~ Income + Rating", 0.5);
Rules and Tips#
Case sensitivity: Variable names in formula strings are case sensitive.
"Income"and"income"refer to different variables.Transformation requirements: Any procedure used as a data transformation must accept a single column vector and return a column vector of the same length.
Intercept defaults: Estimation functions (
olsmt(),glm(),quantileFit()) add an intercept automatically. Data functions (loadd()) and statistics functions (dstatmt()) do not.Whitespace: Spaces around operators are optional but recommended for readability.
"y~x1+x2"and"y ~ x1 + x2"are equivalent.Dataframe column names: Formula strings use the column names stored in the dataframe. Use
getHeaders()to see available names.Files without variable names: Some data files, such as GAUSS matrix files (
.fmt), do not have column names. Refer to columns by position usingX1,X2,X3, and so on:loadd("mydata.fmt", "X1 + X3").
Tip
Use head() to preview a dataframe after loading. This helps you
confirm the variable names and types before writing a formula string.
What’s Next#
Load your own data with
loadd()and explore the available column names withgetHeaders().Run a regression with
olsmt()and examine the output structure.Create grouped plots with
plotScatter()and thebykeyword.
See also
Functions loadd(), olsmt(), glm(), dstatmt(), quantileFit(), plotScatter(), plotXY(), plotBar()