Data Sampling ============================= Sampling with replacement from a matrix or dataframe -------------------------------------------------------- There are two ways to sample with replacement from a matrix or dataframe: * The :func:`sampleData` procedure. * The :func:`rndi` procedure. The :func:`sampleData` procedure directly returns a sample from a matrix or dataframe. The final argument is an indicator for replacement and should be set to 1 to indicate sampling with replacement. Example: Sampling with replacement from a matrix ++++++++++++++++++++++++++++++++++++++++++++++++++ :: // Set seed for repeatable random draws rndseed 23423; // Create a 7x2 vector x = { 1.2 1.8, 2.7 2.1, 3.0 3.3, 4.8 4.1, 5.1 5.4, 6.0 2.8, 7.2 3.9 }; replace = 1; // Take a sample of 5 rows of 'x' with replacement sample = sampleData(x, 5, replace); After the code above, *sample* is equal to: :: 5.1 5.4 3.0 3.3 6.0 2.8 4.8 4.1 3.0 3.3 Repeated observations of ``3.0`` and ``3.3`` occur because the sampling takes place with replacement. The :func:`rndi` function returns random integers from a uniform distribution with the option to specify a range. These can be used as indices for sampling, enabling you to easily draw corresponding rows from two or more variables. .. note:: Sampling with random indices maintains the metadata from the original dataframe and will contain variable names, types, etc. Example: Sampling with replacement from multiple matrices ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ :: // Set seed for repeatable random draws rndseed 73725; y = { 9.1, 2.3, 6.7, 4.4, 5.1 }; X = { 8.3 8.2, 8.8 7.9, 2.4 1.9, 3.9 4.2, 8.2 9.1 }; // Create a random sample of // integers from 1 to 5 idx = rndi(5, 1, 1|5); // Use 'idx' to draw corresponding rows from 'y' and 'X' y_s = y[idx]; X_s = X[idx,.]; After the code above: :: idx = 5 y_s = 5.1 X_s = 8.2 9.1 4 4.4 3.9 4.2 2 2.3 8.8 7.9 3 6.7 2.4 1.9 5 5.1 8.2 9.1 Example: Generating indices to sample from a matrix ++++++++++++++++++++++++++++++++++++++++++++++++++++++ :: // Load data from the 'fueleconomy' dataset // in the GAUSS examples directory file_name = getGAUSSHome("examples/fueleconomy.dat"); fueleconomy = loadd(file_name); // Create a 100x1 vector of random // integers between 1 and 100 range_start = 1; range_end = rows(fueleconomy); idx = rndi(100, 1, range_start | range_end); // Draw a 100 observation sample from 'fueleconomy' fuel_sample = fueleconomy[idx, .]; Sampling without replacement from a matrix -------------------------------------------- The :func:`sampleData` procedure can also be used to sample from a matrix or dataframe without replaced. In this case, the final argument should be set to 0 to indicate sampling without replacement. Example: Sampling without replacement +++++++++++++++++++++++++++++++++++++++++ :: // Set seed for repeatable random draws rndseed 23423; // Create a 7x1 vector x = { 1, 2, 3, 4, 5, 6, 7 }; // Take a sample of 3 elements without replacement s = sampleData(x, 3); .. note:: Setting the :func:`rndseed` before using :func:`sampleData` should be done if you want to replicate the same sample each draw. Drawing a random sample from a dataset ------------------------------------------ The :func:`exctSmpl` procedure draws a sample with replacement from an existing data file and saves the result as a new data file. Neither the data file drawn from nor the new sample created are saved in the GAUSS workspace. The :func:`exctSmpl` procedure returns the number of rows in the new data file OR an error code. Specific error code details are available in Command Reference listing for :func:`exctSmpl`. Example: Sample from data file +++++++++++++++++++++++++++++++++++++++++++ :: // Create file name with full path fname = getGAUSSHome("examples/credit.dat"); // Randomly sample 30% of the rows from 'credit.dat' // and write them to a new dataset in the // GAUSS working directory, named 'sample.dat' n_rows = exctsmpl(fname, "sample.dat", 30); After the code above, :: n_rows = 120