splitData

Purpose

Returns test and training splits for a single matrix or dataframe.

Format

{ X_train, X_test } = splitData(X, train_pct[, shuffle])
Parameters:
  • X (Nx1 vector, or NxP matrix.) – The matrix to split.

  • train_pct (Scalar) – The percentage of observations to include in the training set.

  • shuffle (String) – Optional input, “True” (default) or “False”.

Returns:
  • X_train – (train_pct * N) x P matrix of independent variables.

  • X_test – The remaining observations from the original X which were not selected to be in the training set.

Examples

library gml;

// Set seed for repeatable sampling
rndseed 23324;

X = { 1   3,
      9   6,
      6   1,
      8   4,
      9   5,
      1   8 };

// Shuffle data and create training set with 2/3 of
// the observations and 1/3 for the test set
{ X_train, X_test } = splitData(X, 0.67);

After the above code:

X_train = 9    5
          1    3
          8    4
          1    8

X_test =  9    6
          6    1

Sometimes, for example with time series data, you may not want the data to be shuffled before splitting. Setting the shuffle input to “False” will keep the data in the original order.

library gml;

X = { 1   3,
      9   6,
      6   1,
      8   4,
      9   5,
      1   8 };

// Ceate training set in the original order with 2/3 of
// the observations and 1/3 for the test set
{ X_train, X_test } = splitData(X, 0.67, "False");

After the above code:

X_train = 1    3
          9    6
          6    1
          8    4

X_test =  9    5
          1    8

Remarks

If shuffle is enabled, the observations (rows) of X are kept together. For repeatable shuffling, use the rndseed keyword before calling splitData().