dstatmt#
Purpose#
Compute descriptive statistics.
Format#
- dout = dstatmt(data[, vars, ctl])#
- Parameters:
data (string or dataframe) – A dataframe or the name of dataset. If data is an empty string or 0, vars will be assumed to be a matrix containing the data.
vars (String or string array) –
Optional, the variables.
If data contains a dataframe or the name of a dataset, vars will be interpreted as either:
A Kx1 character vector containing the names of variables.
A Kx1 numeric vector containing indices of variables.
A
formula string
. e.g."PAY + WT"
or". - sex"
e.g"X1 + by(X2)", "by(X2)"
specifies that the data should be separated into different tables based on the groups defined byX2
.
These can be any size subset of the variables in the dataset and can be in any order. Dafault = all columns of the dataset.
If data is an empty string or 0, vars will be interpreted as an NxK matrix, the data on which to compute the descriptive statistics.
ctl (Struct) –
An optional
dstatmtControl
structure containing the following members:ctl.altnames
Kx1 string array of alternate variable names to be used if a matrix in memory is analyzed (i.e., dataset is a null string or 0). Default = “”.
ctl.maxbytes
Scalar, the maximum number of bytes to be read per iteration of the read loop. Default = 1e9.
ctl.vartype
Scalar, unused in dstatmt.
ctl.miss
Scalar, default 0.
- 0:
there are no missing values (fastest).
- 1:
listwise deletion, drop a row if any missings occur in it.
- 2:
pairwise deletion.
ctl.row
Scalar, the number of rows to read per iteration of the read loop.If 0, (default) the number of rows will be calculated using ctl.maxbytes and maxvec.
ctl.output
Scalar, controls output, default 1.
- 1:
print output table.
- 0:
do not print output.
- Returns:
dout (Struct) –
Instance of
dstatmtOut
structure containing the following members:dout.vnames
Kx1 string array, the names of the variables used in the statistics.
dout.mean
Kx1 vector, means.
dout.var
Kx1 vector, variance.
dout.std
Kx1 vector, standard deviation.
dout.min
Kx1 vector, minima.
dout.max
Kx1 vector, maxima.
dout.valid
Kx1 vector, the number of valid cases.
dout.missing
Kx1 vector, the number of missing cases.
dout.errcode
Scalar, error code, 0 if successful; otherwise, one of the following:
- 2:
Can’t open file.
- 7:
Too many missings - no data left after packing.
- 9:
altnames member of
dstatmtControl
structure wrong size.- 10:
vartype member of
dstatmtControl
structure wrong size.
Examples#
Computing statistics on a GAUSS dataset#
// Create file name with full path
file_name = getGAUSSHome("examples/fueleconomy.dat");
/*
** Compute statistics for all variables in the dataset
** The 'call' keyword disregards return values from the function
*/
call dstatmt(file_name);
The above example will print the following report to the Command window:
----------------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
----------------------------------------------------------------------------------------
annual_fuel_cost 2.537 0.6533 0.4267 1.05 5.70 978 0
engine_displacement 3.233 1.376 1.892 1.00 8.40 978 0
The code below uses the second input, vars, to compute only the descriptive statistics for the second variable.
// Create file name with full path
file_name = getGAUSSHome("examples/fueleconomy.dat");
// Only calculate statistics on the second variable
vars = 2;
// Compute statistics for only the second variable in the dataset
call dstatmt(file_name, vars);
The following report is printed to the Command window.
----------------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
----------------------------------------------------------------------------------------
engine_displacement 3.233 1.376 1.892 1 8.4 978 0
Computing statistics on a csv dataset with formula string#
// Create file name with full path
file_name = getGAUSSHome("examples/binary.csv");
// Set up a formula string with variables "gre" and "gpa"
vars = "gre + gpa";
/*
** Compute statistics for all variables in the dataset
** The 'call' keyword disregards return values from the function
*/
call dstatmt(file_name, vars);
The above example will print the following report to the Command window:
--------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
--------------------------------------------------------------------------------
gre 587.7 115.5 1334e+04 220 800 400 0
gpa 3.39 0.3806 0.1448 2.26 4 400 0
Computing statistics by groups#
The code below uses the "by"
keyword to compute the descriptive statistics for mpg and headroom by the groups defined by foreign.
/*
** Perform import
*/
auto2 = loadd(getGAUSShome("examples/auto2.dta"));
// Specify formula to
// compute descriptive statistics on mpg
// based on domestic/foreign status
formula = "headroom + mpg + by(foreign)";
// Print statistics table
call dstatmt(auto2, formula);
=========================================================================================
foreign: Domestic
-----------------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
-----------------------------------------------------------------------------------------
headroom 3.154 0.9158 0.8386 1.5 5 52 0
mpg 19.83 4.743 22.5 12 34 52 0
=========================================================================================
foreign: Foreign
-----------------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
-----------------------------------------------------------------------------------------
headroom 2.614 0.4863 0.2365 1.5 3.5 22 0
mpg 24.77 6.611 43.71 14 41 22 0
Using control and output structures#
// Create file name with full path
file_name = getGAUSSHome("examples/credit.dat");
// Declare control structure and fill in with defaults
struct dstatmtControl dctl;
dctl = dstatmtControlCreate();
// Do not print output to the screen
dctl.output = 0;
// Declare output structure
struct dstatmtOut dout;
// Calculate statistics on the 1st, 3rd and 6th variables
vars = { 1, 3, 6 };
// Calculate statistics, and place output in 'dout'
dout = dstatmt(file_name, vars, dctl);
// Print calculated means and variable names
print dout.mean;
print dout.vnames;
The code above should print the following output:
45.218885
354.94000
13.450000
Income
Rating
Education
Computing statistics on a matrix#
// Set random number seed for repeatable random numbers
rndseed 32452;
// Create a random matrix on which to compute statistics
X = rndn(10, 3);
/*
** The empty string as the second input tells GAUSS to
** compute statistics on a matrix rather than a dataset
*/
call dstatmt("", X);
The code above will print out the following report:
-------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
-------------------------------------------------------------------------------
X1 0.2348 0.8164 0.6664 -1.0736 1.46 10 0
X2 -0.5062 1.126 1.267 -2.223 1.269 10 0
X3 0.5011 0.7758 0.6018 -0.6119 1.823 10 0
Computing statistics on a matrix, using structures#
// Set random number seed for repeatable random numbers
rndseed 32452;
// Declare control structure and fill with default values
struct dstatmtControl dctl;
dctl = dstatmtControlCreate();
// Variable names for printed output
dctl.altnames = "Alpha"$|"Beta"$|"Gamma";
// Declare structure to hold output values
struct dstatmtOut dout;
// Create a random matrix on which to compute statistics
X = rndn(10, 3);
/*
** The empty string as the second input tells GAUSS to
** compute statistics on a matrix rather than a dataset
*/
dout = dstatmt("", X, dctl);
This time, the following output will be printed to the screen:
------------------------------------------------------------------------------
Variable Mean Std Dev Variance Minimum Maximum Valid Missing
------------------------------------------------------------------------------
Alpha 0.2348 0.8164 0.6664 -1.074 1.46 10 0
Beta -0.5062 1.1256 1.267 -2.223 1.269 10 0
Gamma 0.5011 0.7758 0.6018 -0.6119 1.823 10 0
Remarks#
If pairwise deletion is used, the minima and maxima will be the true values for the valid data. The means and standard deviations will be computed using the correct number of valid observations for each variable.
For backwards compatiblitity, the following format is still supported:
dout = dstatmt(dctl, dataset, vars);
However, all new code should use one of the formats listed at the top of this document.
The supported dataset types are
CSV
,XLS
,XLSX
,HDF5
,FMT
,DAT
,DTA
For
HDF5
files, the dataset must include afile schema
and both file name and dataset name must be provided, e.g.dstatmt("h5://testdata.h5/mydata")
.
Source#
dstatmt.src
See also
Functions dstatmtControlCreate()
, formula string