dstatmt#

Purpose#

Compute descriptive statistics.

Format#

dout = dstatmt(data[, vars, ctl])#
Parameters:
  • data (string or dataframe) – A dataframe or the name of dataset. If data is an empty string or 0, vars will be assumed to be a matrix containing the data.

  • vars (String or string array) –

    Optional, the variables.

    If data contains a dataframe or the name of a dataset, vars will be interpreted as either:

    • A Kx1 character vector containing the names of variables.

    • A Kx1 numeric vector containing indices of variables.

    • A formula string. e.g. "PAY + WT" or ". - sex" e.g "X1 + by(X2)", "by(X2)" specifies that the data should be separated into different tables based on the groups defined by X2.

    These can be any size subset of the variables in the dataset and can be in any order. Dafault = all columns of the dataset.

    If data is an empty string or 0, vars will be interpreted as an NxK matrix, the data on which to compute the descriptive statistics.

  • ctl (Struct) –

    An optional dstatmtControl structure containing the following members:

    ctl.altnames

    Kx1 string array of alternate variable names to be used if a matrix in memory is analyzed (i.e., dataset is a null string or 0). Default = “”.

    ctl.maxbytes

    Scalar, the maximum number of bytes to be read per iteration of the read loop. Default = 1e9.

    ctl.vartype

    Scalar, unused in dstatmt.

    ctl.miss

    Scalar, default 0.

    0:

    there are no missing values (fastest).

    1:

    listwise deletion, drop a row if any missings occur in it.

    2:

    pairwise deletion.

    ctl.row

    Scalar, the number of rows to read per iteration of the read loop.If 0, (default) the number of rows will be calculated using ctl.maxbytes and maxvec.

    ctl.output

    Scalar, controls output, default 1.

    1:

    print output table.

    0:

    do not print output.

Returns:

dout (Struct) –

Instance of dstatmtOut structure containing the following members:

dout.vnames

Kx1 string array, the names of the variables used in the statistics.

dout.mean

Kx1 vector, means.

dout.var

Kx1 vector, variance.

dout.std

Kx1 vector, standard deviation.

dout.min

Kx1 vector, minima.

dout.max

Kx1 vector, maxima.

dout.valid

Kx1 vector, the number of valid cases.

dout.missing

Kx1 vector, the number of missing cases.

dout.errcode

Scalar, error code, 0 if successful; otherwise, one of the following:

2:

Can’t open file.

7:

Too many missings - no data left after packing.

9:

altnames member of dstatmtControl structure wrong size.

10:

vartype member of dstatmtControl structure wrong size.

Examples#

Computing statistics on a GAUSS dataset#

// Create file name with full path
file_name = getGAUSSHome("examples/fueleconomy.dat");

/*
** Compute statistics for all variables in the dataset
** The 'call' keyword disregards return values from the function
*/
call  dstatmt(file_name);

The above example will print the following report to the Command window:

----------------------------------------------------------------------------------------
Variable               Mean     Std Dev    Variance   Minimum   Maximum  Valid   Missing
----------------------------------------------------------------------------------------

annual_fuel_cost      2.537     0.6533      0.4267     1.05      5.70     978        0
engine_displacement   3.233      1.376       1.892     1.00      8.40     978        0

The code below uses the second input, vars, to compute only the descriptive statistics for the second variable.

// Create file name with full path
file_name = getGAUSSHome("examples/fueleconomy.dat");

// Only calculate statistics on the second variable
vars = 2;

// Compute statistics for only the second variable in the dataset
call  dstatmt(file_name, vars);

The following report is printed to the Command window.

----------------------------------------------------------------------------------------
Variable                Mean    Std Dev   Variance   Minimum   Maximum   Valid   Missing
----------------------------------------------------------------------------------------
engine_displacement    3.233      1.376     1.892          1       8.4     978         0

Computing statistics on a csv dataset with formula string#

// Create file name with full path
file_name = getGAUSSHome("examples/binary.csv");

// Set up a formula string with variables "gre" and "gpa"
vars = "gre + gpa";

/*
** Compute statistics for all variables in the dataset
** The 'call' keyword disregards return values from the function
*/
call  dstatmt(file_name, vars);

The above example will print the following report to the Command window:

--------------------------------------------------------------------------------
Variable     Mean   Std Dev    Variance    Minimum     Maximum   Valid   Missing
--------------------------------------------------------------------------------

gre         587.7     115.5    1334e+04        220        800     400      0
gpa          3.39    0.3806      0.1448       2.26          4     400      0

Computing statistics by groups#

The code below uses the "by" keyword to compute the descriptive statistics for mpg and headroom by the groups defined by foreign.

/*
** Perform import
*/
auto2 = loadd(getGAUSShome("examples/auto2.dta"));

// Specify formula to
// compute descriptive statistics on mpg
// based on domestic/foreign status
formula = "headroom + mpg + by(foreign)";

// Print statistics table
call dstatmt(auto2, formula);
=========================================================================================
foreign: Domestic
-----------------------------------------------------------------------------------------
Variable         Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-----------------------------------------------------------------------------------------

headroom        3.154      0.9158        0.8386         1.5           5        52    0
mpg             19.83       4.743          22.5          12          34        52    0
=========================================================================================
foreign: Foreign
-----------------------------------------------------------------------------------------
Variable         Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-----------------------------------------------------------------------------------------

headroom        2.614      0.4863        0.2365         1.5         3.5        22    0
mpg             24.77       6.611         43.71          14          41        22    0

Using control and output structures#

// Create file name with full path
file_name = getGAUSSHome("examples/credit.dat");

// Declare control structure and fill in with defaults
struct dstatmtControl dctl;
dctl = dstatmtControlCreate();

// Do not print output to the screen
dctl.output = 0;

// Declare output structure
struct dstatmtOut dout;

// Calculate statistics on the 1st, 3rd and 6th variables
vars = { 1, 3, 6 };

// Calculate statistics, and place output in 'dout'
dout = dstatmt(file_name, vars, dctl);

// Print calculated means and variable names
print dout.mean;
print dout.vnames;

The code above should print the following output:

45.218885
354.94000
13.450000

   Income
   Rating
Education

Computing statistics on a matrix#

// Set random number seed for repeatable random numbers
rndseed 32452;

// Create a random matrix on which to compute statistics
X = rndn(10, 3);

/*
** The empty string as the second input tells GAUSS to
** compute statistics on a matrix rather than a dataset
*/
call dstatmt("", X);

The code above will print out the following report:

-------------------------------------------------------------------------------
Variable    Mean    Std Dev     Variance     Minimum    Maximum  Valid  Missing
-------------------------------------------------------------------------------

X1        0.2348     0.8164       0.6664     -1.0736      1.46     10       0
X2       -0.5062      1.126        1.267      -2.223      1.269    10       0
X3        0.5011     0.7758       0.6018     -0.6119      1.823    10       0

Computing statistics on a matrix, using structures#

// Set random number seed for repeatable random numbers
rndseed 32452;

// Declare control structure and fill with default values
struct dstatmtControl dctl;
dctl = dstatmtControlCreate();

// Variable names for printed output
dctl.altnames = "Alpha"$|"Beta"$|"Gamma";

// Declare structure to hold output values
struct dstatmtOut dout;

// Create a random matrix on which to compute statistics
X = rndn(10, 3);

/*
** The empty string as the second input tells GAUSS to
** compute statistics on a matrix rather than a dataset
*/
dout = dstatmt("", X, dctl);

This time, the following output will be printed to the screen:

------------------------------------------------------------------------------
Variable     Mean    Std Dev    Variance    Minimum    Maximum  Valid  Missing
------------------------------------------------------------------------------

Alpha      0.2348     0.8164      0.6664     -1.074      1.46      10       0
Beta      -0.5062     1.1256       1.267     -2.223     1.269      10       0
Gamma      0.5011     0.7758      0.6018    -0.6119     1.823      10       0

Remarks#

  1. If pairwise deletion is used, the minima and maxima will be the true values for the valid data. The means and standard deviations will be computed using the correct number of valid observations for each variable.

  2. For backwards compatiblitity, the following format is still supported:

    dout = dstatmt(dctl, dataset, vars);
    

    However, all new code should use one of the formats listed at the top of this document.

  3. The supported dataset types are CSV, XLS, XLSX, HDF5, FMT, DAT, DTA

  4. For HDF5 files, the dataset must include a file schema and both file name and dataset name must be provided, e.g. dstatmt("h5://testdata.h5/mydata").

Source#

dstatmt.src

See also

Functions dstatmtControlCreate(), formula string