Data Cleaning ================== Interactive Data Cleaning ----------------------------- Interactive data cleaning can be performed in the **Data Import** window before import, or in a GAUSS **Symbol Editor** after it is loaded. This section will show how to clean data using the **Data Management** pane of a **Symbol Editor**. Most actions will be the same in the **Data Import** window. See `Interactive Data Import <./interactive-import.html>`_ The Data Management pane +++++++++++++++++++++++++++ .. figure:: ../_static/images/data-management-pane.jpg :scale: 50% The **Data Management** pane contains: * The **Filter** tab that allows you to select observations based on a variety of criteria. * The **Variables** tab that allows you to: * Select or remove variables. * Rename variables. * Change variable types. * Manage category labels and order. * Change date display formats. Open the Data Management pane +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ .. figure:: ../_static/images/data-cleaning-open-symbol-editor-filter.jpg To open the **Data Management** pane for an in-memory dataframe: 1. Double-click the name of the dataframe in the **Symbols** window on the **Data** page. 2. Click the **Manage** button with the cog icon on the top right of the open **Symbol Editor** window. Missing values ++++++++++++++++++++ Missing values are represented by a ``.`` for data loaded into GAUSS. Remove observations with missing values interactively ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. figure:: ../_static/images/data-cleaning-filter-missings.jpg :scale: 50% 1. Select the variable to filter on from the **Variable** name drop-down list on the **Filter** tab. 2. Select **Is Not Missing** from the **Operation** drop-down list. 3. Click the ``+`` button to add the filter. All observations where the selected variable contains a missing value will be grayed out in the **Data Preview** window, indicating which observations will be imported. You can click **Apply** or continue to create more filters. Data organization ++++++++++++++++++++ Changing variable names ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. figure:: ../_static/images/data-organization-rename-variable.jpg :scale: 50% 1. Double-click the dataframe you want to modify in the **Symbols** pane of the **Data** page. 2. Click the **Manage** button at the top right of the open **Symbol Editor**. 3. Click downward pointing triangle button to the right of the name of the variable name you want to change and select **Rename**. 4. Enter the new name in the **Name** text box. These changes will not be made until you click **Apply**. Deleting columns from a matrix ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Clear the check box next to the name of the variables you want to remove from the data. These changes will not be made until you click **Apply**. Filtering observations of a dataframe ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The **Data Management** pane provides the following options for filtering dataframes. +--------------------+------------------+--------------------+------------------+ |Data type |Filter options |Data type |Filter options | +====================+==================+====================+==================+ |Numeric and Date | |String and Category | | +--------------------+------------------+--------------------+------------------+ | |= | |Equals | +--------------------+------------------+--------------------+------------------+ | |!= | |Not Equals | +--------------------+------------------+--------------------+------------------+ | |< | |Contains | +--------------------+------------------+--------------------+------------------+ | |<= | |Does not Contain | +--------------------+------------------+--------------------+------------------+ | |> | |Is Empty | +--------------------+------------------+--------------------+------------------+ | |>= | |Is not Empty | +--------------------+------------------+--------------------+------------------+ | |Is Missing | |Is Missing | +--------------------+------------------+--------------------+------------------+ | |Is Not Missing | |Is Not Missing | +--------------------+------------------+--------------------+------------------+ To implement any of these filtering options: 1. Select the variable to filter on from the **Variable** name drop-down list on the **Filter** tab. 2. Select the desired operation from the **Operation** drop-down list. 3. Depending on the operation, either enter or select a value in the **Value** combo box. 4. Click the ``+`` button to add the filter. 5. Either **Apply** your changes or add another filter. Filter based on partial string match ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. figure:: ../_static/images/data-cleaning-filter-contains-buick.jpg Filter based numeric value ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. figure:: ../_static/images/data-cleaning-filter-inequality-mpg.jpg Apply changes ++++++++++++++++++++++++ The **Apply** button at the bottom of the **Data Management** pane allows you to apply the variable modifications and filters created. .. figure:: ../_static/images/interactive-data-cleaning-apply.jpg :scale: 50% To modify the current dataframe, either click **Apply** or click the drop-down and select **Overwrite Existing**. To create a new dataframe containing your changes, click the drop-down next to the **Apply** button and select **Create New**. A text box will appear allowing you to enter the name of the new dataframe. Data types and formats +++++++++++++++++++++++++++++++++++++++++++++ The GAUSS dataframe supports four different data types: * String. * Numeric. * Category. * Date. The **Data Management** pane supports type changing, as well as property management for each type. Changing variable type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To change a variable type select the desired type from the **Type** drop-down list on the **Variables** tab. .. figure:: ../_static/images/interactive-data-cleaning-change-type.jpg :scale: 50% If further type-specific properties are required, a properties dialog will automatically open. Changing categorical mappings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Click the drop-down button to the right of the variable name and select **Properties** to open the **Modify Column Mapping** dialog. .. figure:: ../_static/images/interactive-data-cleaning-variable-properties.jpg :scale: 50% **Change a category label** by double-clicking in the **Renamed Label** textbox next to the category label you want to change, then enter the new label name. .. figure:: ../_static/images/interactive-data-cleaning-modify-column-mapping.jpg :scale: 50% **Specify a category to be the base case** by selecting the **Label** of the category you want to be the new base case then click the double up-pointing arrow button to move the selected category to the base case. The **Category Count** will be listed in the top right of the **Modify Column Mapping** dialog. Specifying date formats ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If GAUSS does not automatically detect your date format, you will be asked to manually specify a date format using the **Specify Date Format** dialog. .. figure:: ../_static/images/data-import-date-specify-dialog.jpg :scale: 50% Build a format string in the **Date format** box, using the BSD strftime specifiers, that represents your data. If your data looked like this ``03/12/2017``, the correct format string would be ``%m/%d/%Y``. The table below explains this. +-----------------+---------------------------+---------+----------------------+ |Original Contents|Description |Type |Format string contents| +=================+===========================+=========+======================+ |03 |A two digit month. |Date |``%m`` | +-----------------+---------------------------+---------+----------------------+ |/ |A forward slash. |Literal |/ | +-----------------+---------------------------+---------+----------------------+ |12 |A two digit day. |Date |``%d`` | +-----------------+---------------------------+---------+----------------------+ |/ |A forward slash. |Literal |/ | +-----------------+---------------------------+---------+----------------------+ |2017 |A four digit year. |Date |``%Y`` | +-----------------+---------------------------+---------+----------------------+ The **Format Options** section of this dialog contains the BSD strftime specifiers for reference. Use the **Filter** drop-down to filter the reference options shown. Further Reading ++++++++++++++++ 1. `Preparing and Cleaning FRED data in GAUSS `_ 2. `Quick and Easy Interactive Data Cleaning `_ 3. `Easy Management of Categorical Variables `_ Programmatic Data Cleaning ------------------------------ Missing value handling ++++++++++++++++++++++++++++++ Counting missing variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The procedure :func:`dstatmt` counts missing values by variable name as part of the descriptive statistics report. It requires only a single input indicating the source of data. The input may be either the file name of a dataset or the name of a matrix or dataframe currently in the workspace. :: // Create file name with full path dataset = getGAUSSHome("examples/auto2.dta"); // Compute descriptive statistics and print report // of a dataset stored on disk call dstatmt(dataset); :: ------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------- make ----- ----- ----- ----- ----- 74 0 price 6165 2949 8.7e+06 3291 1.591e+04 74 0 mpg 21.3 5.786 33.47 12 41 74 0 rep78 ----- ----- ----- Poor Excellent 69 5 headroom 2.993 0.846 0.7157 1.5 5 74 0 trunk 13.76 4.277 18.3 5 23 74 0 weight 3019 777.2 6.04e+05 1760 4840 74 0 length 187.9 22.27 495.8 142 233 74 0 turn 39.65 4.399 19.35 31 51 74 0 displacement 197.3 91.84 8434 79 425 74 0 gear_ratio 3.015 0.4563 0.2082 2.19 3.89 74 0 foreign ----- ----- ----- Domestic Foreign 74 0 A second optional input allows you to specify which columns to use. :: // Create file name with full path dataset = getGAUSSHome("examples/auto2.dta"); // Load data from the file auto = loadd(dataset); // Compute descriptive statistics and print report // of specific variables from a dataframe call dstatmt(auto, "price + mpg + rep78"); :: ------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------- price 6165 2949 8.7e+06 3291 1.591e+04 74 0 mpg 21.3 5.786 33.47 12 41 74 0 rep78 ----- ----- ----- Poor Excellent 69 5 You can count the number of missing values in a vector using :func:`counts`. :: // Create a column vector with 2 missing values x = { 1, ., 3, ., 5, 6 }; // Create a missing value m = miss(); // Count the number of missing values in the vector n = counts(x, m); After running the above code, *n* will be equal to 2. Checking for missing values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can check to see if a matrix or dataframe contains any missing values with :func:`ismiss`. :: // Create one vector with a // missing value and one without a = { 1, 2, 3 }; b = { 4, ., 5 }; // Check whether the vectors contain missing values ret_a = ismiss(a); ret_b = ismiss(b); After the code above, *ret_a* will equal 0, but *ret_b* will equal 1. To find which observations contain missing values, you can use :func:`rowcontains`, :func:`indexcat`, or the dot equality operator ``.==``. First we will load some data and then show these options. :: // Create file name with full path dataset = getGAUSSHome("examples/auto2.dta"); // Load 3 variables auto = loadd(dataset, "mpg + price + rep78"); // Select the first 8 rows auto = head(auto, 8); After the above code, *auto* will equal: :: mpg price rep78 22 4099 Average 17 4749 Average 22 3799 . 20 4816 Average 15 7827 Good 18 5788 Average 26 4453 . 20 5189 Average :func:`indexcat` can tell us the row indices of a column that contains missing values. :: // Create a missing value m = miss(); // Find the indices of the rows with missing values idx = indexcat(auto[.,"rep78"], m); *idx* will now equal: :: 3 7 :func:`rowcontains` will return a binary vector with a 1 for each row where any element contains a missing value. Continuing with our data from above: :: // Return a binary vector with a 1 for // rows that contain a missing value mask = rowContains(auto, miss()); *mask* will equal: :: 0 0 1 0 0 0 1 0 The dot equality operator, ``.==`` will return a binary matrix with a 1 for any element that contains a missing value. Again we will use the data loaded earlier. :: // Return a binary matrix with a 1 // for any element that is a missing value mask = auto .== miss(); *mask* will equal: :: 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Removing missing values ^^^^^^^^^^^^^^^^^^^^^^^^ There are two options for removing missing values from a matrix: * :func:`packr` removes all rows from a matrix that contain any missing values. * :func:`delif` removes all rows which meet a particular condition. :: a = { 1 ., . 4, 5 6 }; // Remove all rows with a missing value print packr(a); will return: :: 5 6 whereas: :: a = { 1 ., . 4, 5 6 }; m = { . }; // Remove all rows with a missing value // in the second column print delif(a, a[., 2] .== m ); will only delete rows with a missing value in the second column. :: . 4 5 6 Replacing missing values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ GAUSS has two functions that can be used to replace missing values: * The :func:`missrv` function. * The :func:`impute` function. The :func:`missrv` function replaces all missing values in a matrix with a user-specified value(s). Unique replacement values can be specified for each column. :: a = { 1 ., . 4, 5 6 }; // Replace all missing values with -999 print missrv(a, -999); :: 1 -999 -999 4 5 6 The :func:`impute` procedure replaces missing values in the columns of a matrix using a specified imputation method. The procedure offers eight potential methods for imputation: * ``"bfill"`` - replaces missing values with the next valid observation (backward fill). * ``"ffill"`` - replaces missing values with the most recent previous valid observation (forward fill). * ``"mean"`` - replaces missing values with the mean of the column. * ``"median"`` - replaces missing values with the median of the column. * ``"mode"`` - replace missing values with the mode of the column. * ``"pmm"`` - replaces missing values using predictive mean matching. * ``"lrd"`` - replace missing values using local residual draws. * ``"predict"`` - replace missing values using linear regression prediction. See the Command Reference for :func:`impute` for more details and examples. Organization ++++++++++++++ Sorting data ^^^^^^^^^^^^^^^ Use :func:`sortc` to sort a matrix or dataframe in ascending order based on a certain column. :: a = { 1 3 5, 7 0 9, 4 2 6 }; // Sort 'a' based on the second column print sortc(a, 2); :: 7 0 9 4 2 6 1 3 5 Matrices and dataframes can be sorted on multiple columns using the :func:`sortmc` procedure. :: a = { 1 3 5, 7 0 9, 4 0 6 }; // Sort 'a' based on the second and third column print sortmc(a, 2|3); :: 4 0 6 7 0 9 1 3 5 .. note:: The :func:`sortmc` and :func:`sortc` sort data in ascending order. To sort data in descending order, wrap the call to the sorting procedure using the procedure :func:`rev` . Changing the order of columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use the :func:`order` procedure to reorder columns in a matrix or dataframe. :: // Create example matrix X = { 9 6 2 6, 9 8 2 1, 3 0 2 9, 1 0 3 0 }; // Put the 2nd and 4th columns first X_2 = order(X, 2|4); After the above code, *X_2* will equal: :: 6 6 9 2 8 1 9 2 0 9 3 2 0 0 1 3 :: // Load some variables from a dataset dataset = getGAUSSHome("examples/yellowstone.csv"); yellowstone = loadd(dataset, "LowtTemp + HighTemp + Visits + TotalPrecip + date($Date)"); // Reorder the dataframe so 'date' and 'visits' // are the first two variables yellowstone_2 = order(yellowstone, "Date" $| "Visits"); After the above code, the first four rows of *yellowstone* will be: :: LowtTemp HighTemp Visits TotalPrecip Date -17.0 37.0 30621 1.09 2016/01/01 -17.0 42.0 28091 0.770 2015/01/01 -19.0 41.0 26778 1.28 2014/01/01 -22.0 43.0 24699 0.610 2013/01/01 while the first four rows of *yellowstone_2* look like this: :: Date Visits LowtTemp HighTemp TotalPrecip 2016/01/01 30621 -17.0 37.0 1.09 2015/01/01 28091 -17.0 42.0 0.770 2014/01/01 26778 -19.0 41.0 1.28 2013/01/01 24699 -22.0 43.0 0.610 Deleting columns ^^^^^^^^^^^^^^^^^^^^^ You can delete columns from a matrix using the :func:`delcols` procedure. The columns to remove can be specified as numeric indices for matrices and dataframes: :: a = { 1 3 5 7, 7 0 9 4, 4 2 6 2 }; // Remove the 1st and 3rd column from 'a' print delcols(a, 1|3); :: 3 7 0 4 2 2 You can also use column names to delete columns from a dataframe. :: // Create file name with full path dataset = getGAUSSHome("examples/detroit.sas7bdat"); // Load 4 variables from the dataset detroit = loadd(dataset, "unemployment + weekly_earn + hourly_earn + assault"); // Remove 2 variables from 'detroit' by name detroit = delcols(detroit, "weekly_earn" $| "hourly_earn"); // Print the first 4 rows of 'detroit' print detroit[1:4, .]; :: unemployment assault 11.0 306.18 7.0 315.16 5.2 277.53 4.3 234.07 Deleting rows from a matrix ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Two GAUSS functions are available for deleting rows from a matrix: :func:`delrows` deletes rows based on the specified row number. :: a = { 1 2, 3 4, 5 6, 7 8 }; // Remove the 2nd and 4th row of 'a' print delrows(a, 2|4); :: 1 2 5 6 :func:`trimr` trims rows from either the top and bottom of a matrix. :: a = { 1 2, 3 4, 5 6, 7 8 }; // Trim the top row and the bottom // 2 rows from 'a' print trimr(a, 1, 2); :: 3 4 Conditionally deleting rows of data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :func:`delif` conditionally deletes rows from a matrix, dataframe or string array based upon a logical vector. :: a = { 1 2, 3 4, 5 6, 7 8 }; // Remove rows where the element in the // first column of 'a' is equal to 3 print delif(a, a[., 1] .== 3); :: 1 2 5 6 7 8 Conditionally selecting data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can conditionally select data from a matrix, dataframe, or string array using the :func:`selif` procedure. Enter the data as the first input to :func:`selif` and the condition to be used for selecting data as the second input. :: a = { 1 2, 3 4, 5 6, 7 8 }; // Keep rows where the element in the second // column of 'a' is less than or equal to 6 print selif(a, a[., 2] .<= 6); :: 1 2 3 4 5 6 Variable types and names +++++++++++++++++++++++++++++++++ Determining variable or column types ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use the :func:`getColTypes` procedure to lookup the type of the variables in a dataframe. :func:`getColTypes` returns a dataframe. The table below shows the type labels and their corresponding integer values. +-------+---------+ |Value |Label | +=======+=========+ |0 |String | +-------+---------+ |1 |Numeric | +-------+---------+ |2 |Category | +-------+---------+ |3 |Date | +-------+---------+ :: // Load 4 variables of different types from a dataset dataset = getGAUSSHome("examples/nba_ht_wt.xls"); nba_ht_wt = loadd(dataset, "str(Player) + cat(Pos) + Age + date($BDate, '%m/%d/%Y')"); // Check the types of each variable in 'nba_ht_wt' print getColTypes(nba_ht_wt); The above code will print: :: type String Category Numeric Date :func:`getColTypes` also accepts a second optional input that allows you to check only specified column types. Continuing with the data from our previous example: :: // Check the types of the 2nd and 4th variables in 'nba_ht_wt' print getColTypes(nba_ht_wt, 2|4); will return: :: type Category Date Setting a variable type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :func:`asdate` sets the variable type of one or more columns of a matrix or dataframe to be a date. It can also optionally set the date display format. :: // Create a column of numbers which represent // seconds since Jan 1, 1970 (Posix time) d = { 0, 86400, 172800, 259200 }; // Set the variable type of 'd' to be a date d = dfType(d, "Date"); After the above code, *d* will be a date and if we print it we will see: :: X1 1970-01-01 1970-01-02 1970-01-03 1970-01-04 :func:`dftype` is the more general function. It can set columns to any of the four types: numeric, string, category or date. It also accepts an optional input specifying the indices or variable names to be checked. :: // Load 3 variables of different types from a dataset dataset = getGAUSSHome("examples/nba_ht_wt.xls"); nba = loadd(dataset, "str(player) + cat(pos) + age"); After loading the above data, the first four rows of *nba* will be: :: player pos age Vitor Faverani C 25 Avery Bradley G 22 Keith Bogans G 33 Jared Sullinger F 21 We can change the type of the second column from a categorical to a numeric variable like this: :: // Set the second column to be numeric nba = dfType(nba, "Number", "pos"); After this code, the first four rows of *nba* will be: :: player pos age Vitor Faverani 0 25 Avery Bradley 2 22 Keith Bogans 2 33 Jared Sullinger 1 21 The elements of the *pos* now contain only the numeric values that correspond to the string category labels. The string labels, ``"C"``, ``"F"`` and ``"G"`` have been removed. .. note:: You can convert a matrix or string array to a dataframe with :func:`asdf`. Determining current variable names ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :func:`getColNames` procedure returns the variable names assigned to columns in a matrix. :: // Load all variables from a CSV file dataset = getGAUSSHome("examples/housing.csv"); housing = loadd(dataset); // Print the variable names from 'housing' print getcolnames(housing); The above code will print out the string array: :: taxes beds baths new price size In addition, it accepts an optional input specifying the indices of the columns of interest. For example, continuing with our previous example: :: // Print the names of the 3rd and 5th variable name print getcolnames(housing, 3|5); will return: :: baths price Setting variable names ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :func:`setColNames` procedure changes or adds variables names to a matrix or dataframe. :: // Create example matrix X = { 1 2, 3 4, 5 6 }; // Assign variable names to the columns of 'X' X = setcolnames(X, "alpha" $| "beta"); print X; The above code will print: :: alpha beta 1 2 3 4 5 6 It also accepts an optional input specifying the indices or names to be changed. For example, continuing with the example above: :: // Set the second variable name from 'X' to 'gamma' X = setcolnames(X, "gamma", 2); print X; The above code will print: :: alpha gamma 1 2 3 4 5 6 If the data does not currently have variable names, names will be created for all columns, with default names being assigned to any columns for which user-specified names were not provided. Managing category labels ++++++++++++++++++++++++ Extracting current category labels ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :func:`getColLabels` returns the string category labels and corresponding integer values for a categorical or string column of a dataframe. :: // Create a file name with full path dataset = getGAUSSHome("examples/auto2.dta"); // Load all variables from the dataset auto = loadd(dataset); // Return the string category labels and // corresponding numeric values { labels, values } = getColLabels(auto, "rep78"); After running the code above: :: labels = Poor Values = 1 Fair 2 Average 3 Good 4 Excellent 5 Alternatively, it :func:`getCategories` procedure will return the just the category labels as a GAUSS datframe: :: // Get category labels as GAUSS dataframe labels_df = getCategories(auto, "rep78"); // Print labels labels_df; :: categories Poor Fair Average Good Excellent Setting category labels ^^^^^^^^^^^^^^^^^^^^^^^^ The :func:`setColLabels` procedure allows you to add or modify the labels of categorical variables. It changes the current type of the column to a categorical variable. :: // Create example matrix X = { 1.4 0, 1.9 2, 2.3 1, 0.9 2 }; labels = "low" $| "medium" $| "high"; values = { 0, 1, 2 }; // Make the second column of 'X' a // categorical variable with the // provided labels and values X = setColLabels(X, labels, values, 2); print X; The above code will return: :: X1 X2 1.4 low 1.9 high 2.3 medium 0.9 high .. note:: If a label is not provided for all key values, the unlabeled key values will be given blank labels. Changing category labels ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :func:`recodecatlabels` procedure changes category labels. :: // Load NBA data dataset = getGAUSSHome("examples/nba_ht_wt.xls"); nba = loadd(dataset); // Get column labels { labels, values } = getColLabels(nba, "Pos"); Here are the initial category labels and order. :: labels = C values = 0 F 1 G 2 We can change the category labels like this: :: // Specify current labels old_labels = "C" $| "F" $| "G"; // Specify new labels to set new_labels = "Center" $| "Forward" $| "Guard"; // Recode the old labels to the new labels nba = recodeCatLabels(nba, old_labels, new_labels, "Pos"); // Get column labels { labels, values } = getColLabels(nba, "Pos"); :: labels = Center values = 0 Forward 1 Guard 2 As we can see above the label names have changed, but the underlying values and order are the same. The :func:`recodecatlabels` procedure can be used to change individual labels, rather than all labels. For example, to change just one label we could have used: :: // Specify current labels old_labels = "C"; // Specify new labels to set new_labels = "Center"; // Recode the old labels to the new labels nba = recodeCatLabels(nba, old_labels, new_labels, "Pos"); // Get column labels { labels, values } = getColLabels(nba, "Pos"); Dropping category labels ^^^^^^^^^^^^^^^^^^^^^^^^ The :func:`dropCategories` procedure drops all observations of a specified category label from a dataframe and updates the category mapping. Note that this functionality is different from deleting observations using other tools like :func:`delif`. For example, consider dropping observations using delif: :: // Load NBA data dataset = getGAUSSHome("examples/nba_ht_wt.xls"); nba = loadd(dataset); // Delete observations with 'rep78' // equal to "Poor" nba_no_center = delif(nba, nba[., "Pos"] .== "C"); // Get label getCategories(nba_no_center, "Pos"); :: categories C F G In this case, the ``C`` category label is still stored as one of the *Pos* labels. To remove it when deleting labels :func:`dropCategories`` should be used. :: // Drop observations and remove label from metadata nba_no_center_2 = dropCategories(nba, "C", "Pos"); // Check 'rep78' categories getCategories(nba_no_center_2, "Pos"); :: categories F G Change the order of categories in a dataframe ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :: // Load dataset dataset = getGAUSSHome("examples/yarn.xlsx"); yarn = loadd(dataset); // Get labels and values for amplitude variable // in yarn dataframe { labels_1, values_1 } = getColLabels(yarn, "amplitude"); After the above code: :: labels_1 = high values_1 = 0 low 1 med 2 Since Excel files do not provide labels or order for string columns, GAUSS assigns the category value based on alphabetical order. We can reorder the categories like this: :: // Change the order of the category labels for the // variable 'amplitude' in 'yarn' yarn = reordercatlabels(yarn, "low" $| "med" $| "high", "amplitude"); // Get column labels and key values for `amplitude` { labels_2, values_2 } = getColLabels(yarn, "amplitude"); After the above code: :: labels_2 = low values_2 = 0 med 1 high 2 Changing category base case ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :func:`setbasecat` function provides a convenient way to set the base case for a categorical variable. :: // Load the NBA dataset dataset = getGAUSSHome("examples/nba_ht_wt.xls"); nba = loadd(dataset); // Get column names { labels, values } = getColLabels(nba, "Pos"); After the above code: :: labels = C values = 0 F 1 G 2 You can change ``G`` to the base case like this: :: // Change the `G` category to the basecase nba = setBaseCat(nba, "G", "Pos"); // Get new labels { labels, values } = getColLabels(nba, "Pos"); As we can see below, the new base case, ``G``, has been moved to the top and all the other variables have been shifted down. :: labels = G values = 0 C 1 F 2 Cleaning strings and category labels ++++++++++++++++++++++++++++++++++++ GAUSS has a comprehensive suite of tools for managing and cleaning strings. Trimming whitespaces ^^^^^^^^^^^^^^^^^^^^^ Excess whitespaces in strings and categorical variables can lead to unexpected results. To prevent this, trimming excess whitespaces should be done using one of three GAUSS procedures: * The :func:`strtrimr` procedure strips whitespace characters from the right side. * The :func:`strtriml` procedure strips whitespace characters from the left side. * The :func:`strtrim` procedure strips whitespace characters from both the left and right side. **Example: Trimming all whitespaces** :: // Create string array string names_string = { " John", "Mary ", " Jane ", "Carl" }; // Convert to string array names_df = asDF(names_string, "First Name"); // Check names print names_df[3]; print names_df[4]; Printing the third and fourth elements of *names_df* highlights the whitespaces in the *First Name* variable. :: First Name Jane First Name Carl Compare this to printing the four element, which contains no whitespaces. :: // Trim whitespaces names_df = strtrim(names_df); // Check names print names_df[3]; print names_df[4]; :: First Name Jane First Name Carl .. note:: The :func:`print` function will automatically align the string array, so ``print header_sa`` will make it appear as if the leading and trailing spaces are gone. To see the spaces, we print individual elements. Standardizing case ^^^^^^^^^^^^^^^^^^ Symbol names in GAUSS are not case-sensitive. For example, consider the following example of variable naming. :: // Assign values to 'x' x = 5; // Print little x value print "little x:" x; :: little x: 5.0000000 :: // Assign values to 'X' X = 10; print "little x:" x; :: little x: 10.000000 However, string and category variables, as well as variable names, are case sensitive. Because of this, inconsistent use of cases in strings and category labels can result in undesired results. For example, consider survey data with self reported location abbreviations. :: // Generate states string array st_abbreviation = "CO" $| "Co" $| "CA" $| "CA" $| "Ca" $| "Mo" $| "MO"; // Convert to dataframe st_df = asDF(st_abbreviation, "State"); // Print observations st_df; // Print categories getCategories(st_df); Because of differences in cases, GAUSS thinks there are 6 different categories. :: categories CA CO Ca Co MO Mo Consider if we use these categories and compute a frequency count. :: // Compute frequency count for 'State' frequency(st_df, "State"); :: Label Count Total % Cum. % CA 2 28.57 28.57 CO 1 14.29 42.86 Ca 1 14.29 57.14 Co 1 14.29 71.43 MO 1 14.29 85.71 Mo 1 14.29 100 Total 7 100 To remedy this, :func:`upper` or :func:`lower` should be used to convert the state abbreviations to the same case. :: // Convert to upper case st_df = upper(st_df); // Print 'states' st_df; // Compute frequency count frequency(st_df, "State"); :: State CO CO CA CA CA MO MO Label Count Total % Cum. % CA 3 42.86 42.86 CO 2 28.57 71.43 MO 2 28.57 100 Total 7 100 Note that :func:`upper` converts all observations of *st_df* to upper case and updates the label mappings. Searching and replacing strings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Searching and replacing is a key part of cleaning strings and categorical data. This can be done using a number of GAUSS functions: +-------------------+-------------------------------------+ |Procedure |Description | +===================+=====================================+ |:func:`strrindx` |Finds the index of one string within | | |another string. Searches from the end| | |to the beginning. | +-------------------+-------------------------------------+ |:func:`strindx` |Finds the index of one string within | | |another string. | +-------------------+-------------------------------------+ |:func:`strreplace` |Replaces all matches of a substring | | |with a replacement string. | +-------------------+-------------------------------------+ |:func:`startsWith` |Returns a 1 if a string starts with | | |a specified pattern. | +-------------------+-------------------------------------+ **Searching across multiple variables** The :func:`strindx` and :func:`strrindx` procedures perform element-by-element searches for substrings in string arrays. They return the starting indices of the substring or a 0 if the substring is not found. :: // Create file name with full path fname = getGAUSSHome("examples/auto2.dta"); // Load 'rep78' and 'make` variable auto = loadd(fname, "rep78 + make"); // Preview data head(auto); The first five occurrences of the *auto* dataframe look like: :: rep78 make Average AMC Concord Average AMC Pacer . AMC Spirit Average Buick Century Good Buick Electra Now we will search for different substrings in each separate variables. :: // Find the index of "age" in 'rep78' // and "AMC" in 'make' idx = strindx(auto, "age"$~"AMC"); // Print the first 5 observations of 'idx' head(idx); The preview shows that: * In the *rep78* variable, the substring ``"age"`` starts at the fifth letter every time ``Average`` is observed. * In the *make* variable, the substring ``"AMC"`` starts at the first letter of the first three observations. :: 5.0000000 1.0000000 5.0000000 1.0000000 0.0000000 1.0000000 5.0000000 0.0000000 0.0000000 0.0000000 There are many uses of this. For example, suppose we want to select only the observations that contain the substring ``"AMC"``: :: // Select observation if 'AMC` occurs anywhere // in the string amc_data = selif(auto, idx[., 2]); // Preview data head(amc_data); The preview of the *amc_data* only prints 3 observations because only three observations remain. :: rep78 make Average AMC Concord Average AMC Pacer . AMC Spirit **Searching for starting substrings** In the previous example, we search for the substring ``"AMC"``. Altenatively, :func:`startsWith` could be used to search for starting patterns. :: // Load 3 variables from the dataset fname = getGAUSSHome("examples/auto2.dta"); auto = loadd(fname, "make + price + mpg"); // Specify pattern to search for pat = "Buick"; // Find all makes that include 'Buick' mask = startsWith(auto[., "make"], pat); // Select observations if the corresponding // row of mask equals 1. auto_buicks = selif(auto, mask); print auto_buicks; The code above selects all observations that start with ``"Buick"``. :: make price mpg Buick Century 4816.0000 20.000000 Buick Electra 7827.0000 15.000000 Buick LeSabre 5788.0000 18.000000 Buick Opel 4453.0000 26.000000 Buick Regal 5189.0000 20.000000 Buick Riviera 10372.000 16.000000 Buick Skylark 4082.0000 19.000000 **Regularize a string array** In this example, :func:`strreplace` is used to clean a string array that contains addresses. :: // String array to be searched str = "100 Main Ave" $| "112 Charles Avenue" $| "49 W State St" $| "24 Third Avenue"; // Search for string 'Avenue' search = "Avenue"; // String to replace with replace = "Ave"; // Build new string new_str = strreplace(str, search, replace); After the code above, *new_str* will be set to: :: "100 Main Ave" "112 Charles Ave" "49 W State St" "24 Third Ave" **Cleaning categorical labels** The :func:`strreplace` procedure can be used to clean categorical labels and will simultaneously updated the mapping of labels and keyvalues. :: // Create 5x1 string array states = "CA" $| "FL" $| "California" $| "California" $| "FL"; // Convert the string array to a dataframe // with the variable name 'States' df_states = asdf(states, "States"); // Print the dataframe print df_states; // Check category getCategories(df_states); The *df_states* dataframe is: :: States CA FL California California FL And the associated categories are: :: categories CA California FL Suppose that for the sake of our analysis, the category ``CA`` and ``California`` are treated the same. This can be corrected using the :func:`strreplace` procedure. :: // Search for the "California" label search = "California"; // Replace the "California" label with "CA" replace = "CA"; // Call 'strreplace' df_states = strreplace(df_states, search, replace); // Print dataframe print df_states; After this, all occurrences of ``California`` have been replaced with ``CA``. :: States CA FL CA CA FL Checking the categories will confirm that the keyvalues and labels have been updated. :: // Get updated categories getCategories(df_states); As we see below, the observations that previously had the label ``California``, have now been merged with the ``CA`` category. :: categories CA FL Extracting substrings ^^^^^^^^^^^^^^^^^^^^^^ The :func:`strsect` procedure extracts a substring of a string based on a specified starting point and an optional length. :: // Create location string array location = "CA, Sacramento" $| "FL Tampa" $| "SC - Charleston" $| "NC - Raleigh"; // Convert to dataframe location_df = asDF(location, "Location"); // Print location location_df; The *location_df* variable is not consistently formatted, with the exception that the two letter state abbreviations always occurs first. :: Location CA, Sacramento FL Tampa SC - Charleston NC - Raleigh The :func:`strsect` procedure can be used to extract the state abbreviation :: // Extract the state abbreviation state = strsect(location_df, 1, 2); // Print state state; :: Location CA FL SC NC .. note: The :func:`strsect` function can be used with :func:`strindx` to search for an extract substrings. Splitting strings ^^^^^^^^^^^^^^^^^^ The :func:`strsplit` procedure can be used to split strings based on a specified separator. :: // Create location string array location = "FL, Tampa" $| "CO, Denver" $| "SC, Charleston" $| "NC, Raleigh"; // Convert to dataframe location_df = asDF(location, "Location"); // Print preview head(location_df); :: Location FL, Tampa CO, Denver SC, Charleston NC, Raleigh Now our *Location* variable contains state abbreviations and cities, separated by ``","``. We can use :func:`strsplit` to split the location variable into two separate columns. :: // Split the location dataframe into to columns // using ',' as a separator location_split = strsplit(location_df, ","); // Print location dataframe location_split; :: X1 X2 FL Tampa CO Denver SC Charleston NC Raleigh The *location_split* dataframe contains two columns, one that contains the state abbreviation and one that contains the city name. There are additional steps that will help further clean this data: * Rename the columns * Remove any whitespaces. :: // Set column names location_split = setColNames(location_split, "State"$|"City"); // Remove excess whitespace location_split = strtrim(location_split); // Print dataframe location_split; :: State City FL Tampa CO Denver SC Charleston NC Raleigh Further Reading ++++++++++++++++ 1. `Preparing and Cleaning FRED data in GAUSS `_ 2. `Managing String Data with GAUSS Dataframes `_ 3. `Getting Started With Survey Data In GAUSS `_ 4. `Getting to Know Your Data With GAUSS 22 `_