aggregate

Purpose

Aggregates the data in the columns of a matrix based upon a column containing group ids with a choice of method.

Format

x_agg = aggregate(x, method[, id_cols, skip_miss_check])
Parameters:
  • x (NxK matrix or dataframe) – Data, if id_cols is not specified, the first column must contain the ids for the groups on which to aggregate.
  • method (String) –

    Specifies which aggregation method to use.

    Valid options:

    ”mean”
    ”median”
    ”mode”
    ”min”
    ”max”
    ”sd” (sample standard deviation)
    ”sum”
    ”variance” (sample variance)
  • id_cols (string or string array) – Optional, specifies which variable(s) contain the groups on which to aggregate. To aggregate by multiple columns, use the $| operator to concatenate column names (e.g., "day" $| "time").
  • skip_miss_check (scalar) – Optional. Default: 0. Set to 1 to skip checking for missing values (faster but missings may affect results). When 0, missing values are handled per-column.
Returns:

x_agg (NGROUPSxK matrix) – The input aggregated by the group id, using the specified method.

Examples

Example 1

This example aggregates a matrix with one group id column and one column of data by mean and then by minimum.

// Create a matrix where the first
// column is the group id
X = { 1002  7,
      1001  2,
      1004  9,
      1001  8,
      1004  6,
      1003  3,
      1002  5,
      1001  4 };

agg_mean = aggregate(X, "mean");

agg_min = aggregate(X, "min");

The above code will make the following assignments:

           1001   4.66667
agg_mean = 1002         6
           1003         3
           1004       7.5

           1001         2
agg_min  = 1002         5
           1003         3
           1004         6

Example 2

This example aggregates the data from a matrix with one group id column and two data columns first by sample standard deviation and then by variance.

// Create a matrix where the first
// column is the group id
X = { 1002   18  -5.1,
      1001   22   0.0,
      1001   47   3.3,
      1001   94   5.6,
      1001   17  -0.5,
      1001   72   7.5,
      1002   89   4.8,
      1001   67   2.3,
      1002   54   6.6,
      1002   61  -6.8,
      1002    7   1.3,
      1002   40   -2.1 };

// aggregate by standard deviation
agg_sd = aggregate(X, "sd");

agg_var = aggregate(X, "variance");

The above code will make the following assignments:

agg_sd  = 1001    30.10     3.13
          1002    29.90     5.38

agg_var = 1001   906.17     9.77
          1002   894.17    28.93

Example 3

This example specifies the column name to be used for aggregation.

// Load data
auto2 = loadd(getGAUSSHome("examples/auto2.dta"));

// Aggregate data using
// foreign column as group
aggregate(auto2[., "price" "mpg" "foreign"], "mean", "foreign");

The above code will make the following table

foreign     price      mpg
Domestic  6072.423   19.827
Foreign   6384.682   24.773

Example 4

This example aggregates by multiple group columns, finding the maximum value for each combination of day and time.

// Load data
tips = loadd(getGAUSSHome("examples/tips2.dta"), "day + time + total_bill + tip");

// View first few rows
head(tips);

The above code will print:

day             time       total_bill              tip
Sun           Dinner        16.990000        1.0100000
Sun           Dinner        10.340000        1.6600000
Sun           Dinner        21.010000        3.5000000
Sun           Dinner        23.680000        3.3100000
Sun           Dinner        24.590000        3.6100000
// Aggregate by day and time, finding max values
tips_a = aggregate(tips, "max", "day" $| "time");
print tips_a;

The above code will print:

 day             time       total_bill              tip
Thur            Lunch        43.110000        6.7000000
Thur           Dinner        18.780000        3.0000000
 Fri            Lunch        16.270000        3.4800000
 Fri           Dinner        40.170000        4.7300000
 Sat           Dinner        50.810000        10.000000
 Sun           Dinner        48.170000        6.5000000

See also

Functions meanc(), modec(), selif()