Introduction to GAUSS for Stata Users ======================================= This page provides a basic overview of how common Stata operations can be implemented in GAUSS. It is not meant to serve as a comprehensive GAUSS guide. However, we do provide references for those who wish to explore topics in greater depth. Data Storage ----------------------------------------------------------- GAUSS stores data in matrices, string arrays, and dataframes. One of the key differences between data storage in GAUSS and Stata is that GAUSS allows you to store data from multiple sources simultaneously. In Stata, people are most familiar with working with a single dataset in memory. Stata does allow you to store multiple datasets in memory using specified dataframes but special commands must be used to switch between frames. +--------------------+-----------------------+--------------------+ | Reference | GAUSS | Stata | +====================+=======================+====================+ |Data structure | Dataframe or matrix | Data set | +--------------------+-----------------------+--------------------+ |Series of data | Column | Variable | +--------------------+-----------------------+--------------------+ |Single occurrence | Row | Observation | +--------------------+-----------------------+--------------------+ |Missing Values | `.` | `.` | +--------------------+-----------------------+--------------------+ What is a GAUSS dataframe? ++++++++++++++++++++++++++++++ .. figure:: ../_static/images/data-import-window-1.jpg :width: 80% A GAUSS dataframe is used to store two-dimensional data and allows you to store: * Data in rows and columns. * Information about the data type and type-related properties. * Different variables together, including categorical data, strings, and dates. Many internal functions are designed to work intelligently with dataframes to use variable names and types for estimation and reporting. For example, :func:`olsmt` will use the information stored in a dataframe during estimation to: * Properly include dummy variables when categorical independent variables are present. * Include variable names in output reports. Variables ^^^^^^^^^^^^^^^^ Each column of a GAUSS dataframe contains a series of data for a single variable. Variables are stored as strings, numbers, categories, or dates. In Stata, variables are referenced directly by name. .. code-block:: Stata list mpg In GAUSS, variables can be referenced by indexing with variable name or by column number. However, we must tell GAUSS which dataframe to look for the variable in. For example, if the variable ``mpg`` is stored in the fourth column of the dataframe ``auto2`` we could use either :: auto2[., "mpg"]; or :: auto2[., 4]; to reference the variable. .. note:: The ``.`` indicates to GAUSS that all rows are being indexed. This will be discussed in more detail in the indexing section. +--------------------+---------------------------------------------+------------------------------------+ | Variable | Description | Examples | | Type | | | +====================+=============================================+====================================+ |String |The string data type can contain letters, | Customer names, product names, | | |numbers, and other characters. | or book titles. | +--------------------+---------------------------------------------+------------------------------------+ |Number |Analogous to the data stored in | Daily temperatures, real GDP, | | |GAUSS matrices. | stock prices. | +--------------------+---------------------------------------------+------------------------------------+ |Categories |Houses discrete variables that capture | Marriage status, performance | | |qualitative data. | ratings, transportation modes. | +--------------------+---------------------------------------------+------------------------------------+ |Dates |Houses and displays dates and times. | Purchase date, shipping date, | | | | observation date. | +--------------------+---------------------------------------------+------------------------------------+ Observations ^^^^^^^^^^^^^^^^ Each row of a GAUSS dataframe contains simultaneous observations of variables. In `time series data `_ or `panel data `_ , this may correspond to dates of observations. In cross-sectional data, this may correspond to some other identifier such as identification number, observation number, or name. Rows of data are indexed by row number. For example, if we want to access the data stored in the fourth row we use :: auto2[4, .]; Data Input/Output -------------------- Constructing a dataframe from values +++++++++++++++++++++++++++++++++++++ In Stata, the ``input`` statement is used to build datasets from specified values and column names: .. code-block:: Stata input x y 1 2 3 4 5 6 end In GAUSS, a dataframe can be created from a manually entered matrix and variable names using the :func:`asDF` procedure: :: // Create a 3 x 2 matrix mat = { 1 2, 3 4, 5 6 }; // Convert matrix to a dataframe // and name the first column "X" // and the second column "Y" df = asDF(mat, "X", "Y"); Reading external datasets +++++++++++++++++++++++++++++++++++++ GAUSS can directly read and load data from most data formats, including: * CSV * Excel (XLS, XLSX) * HDF 5 * GAUSS matrices (FMT) * GAUSS datasets (DAT) * Stata datasets (DTA) * SAS datasets (SAS7BDAT, SAS7BCAT) * SPSS datasets (SAV) In Stata, the ``import`` command is used to import non-Stata datasets. Additional information must be provided to specify what type of file is being imported. .. code-block:: Stata import excel "nba_ht_wt.xls", clear Alternatively, the ``tips2.csv`` dataset is loaded into Stata using the import delimited command .. code-block:: Stata import delimited "tips2.csv", clear .. note:: The use of the ``clear`` option is necessary in Stata if the data is already loaded into the workspace. In GAUSS, this is not necessary because multiple data sets can be loaded into the work space simultaneously. In GAUSS, all data files are usually loaded using the :func:`loadd` procedure. For example, consider loading the ``auto2.dta`` dataset: :: // Load all variables from the file 'auto2.dta' // using their default types auto2 = loadd(getGAUSSHome $+ "examples/auto2.dta"); This loads all the variables in the dataset and auto-detects their type. .. figure:: ../_static/images/data-import-window-1.jpg :width: 80% Sometimes, you may need to specify the type and/or variables that you wish to load. This is done using a `formula string `_: For example, let’s consider loading the ``nba_ht_wt.xls`` file in GAUSS :: // Create filename fname = getGAUSSHome $+ "examples/nba_ht_wt.xls"; // Load the file 'nba_ht_wt.xls' // using a formula string to select variables // and specify variable types nba_ht_wt = loadd(fname, "str(Player) + cat(Pos) + Height + Weight + str(School)"); Similarly, the ``tips2.csv`` data file: :: // Create filename fname = getGAUSSHome $+ "examples/tips2.csv"; // Load the file 'tips2.csv' // using a formula string to select variables // and specify variable types tips2 = loadd(fname, "id + total_bill + tip + cat(sex) + cat(time)"); .. note:: The :func:`getGAUSSHome` function is a convenience function that returns the full path to the GAUSS home directory. Formula strings accept a number of operators and keywords which allow you to: * Specify variable types. * Perform data transformations. +--------------------+---------------------------------------------+ |Operator | Purpose | +====================+=============================================+ | ``.`` |Represents all variables. | +--------------------+---------------------------------------------+ | ``+`` |Adds a variable. | +--------------------+---------------------------------------------+ | ``-`` |Removes a variable. | +--------------------+---------------------------------------------+ | ``1`` |Represents an intercept term. | +--------------------+---------------------------------------------+ | ``*`` |Adds an interaction term and includes both | | |original variables. | +--------------------+---------------------------------------------+ | ``:`` |Adds an interaction term between two | | |variables but does not include either | | |of the original variables. | +--------------------+---------------------------------------------+ +--------------------+---------------------------------------------+ |Keyword | Purpose | +====================+=============================================+ | ``cat`` |Load a variable as a categorical column. | +--------------------+---------------------------------------------+ | ``date`` |Load a variable as a date column. | +--------------------+---------------------------------------------+ | ``str`` |Load a variable as a string column. | +--------------------+---------------------------------------------+ | ``$`` |Indicate that a variable is stored in the | | |file as a string as should be passed to the | | |keyword or procedure as a string column. | +--------------------+---------------------------------------------+ The GAUSS Data Management guide provides a complete guide to `Programmatic Data Import `_. Interactively loading data +++++++++++++++++++++++++++++++++++++ The GAUSS **Data Import** window is a completely interactive environment for loading data and performing preliminary data cleaning. It can be used to: * Select variables and change types. * Select observation by range or logic filtering. * Manage date formats and category labels. * Preview data. The **Data Import** window offers a data import experience similar to Stata’s menu driven data import. Like Stata, the GAUSS **Data Import** window auto-generates code that can be reused. .. figure:: ../_static/images/data-import-code-generation.png :width: 80% You can open the **Data Import** window in three ways: * Select **File > Import Data** from the main GAUSS menu bar. * From the **Project Folders** window: * Double-click on the name of the data file. * Right-click the file and select **Import Data**. A complete `guide to interactively loading data `_ is available in the GAUSS Data Management guide. Viewing Data +++++++++++++++++ Data can be viewed in GAUSS a number of ways: * Using the **GAUSS Data Editor**. .. figure:: ../_static/images/data-cleaning-open-symbol-editor-filter.jpg :width: 80% * Opening a floating **Symbols Editor** window using `Ctrl+E`. * Printing data to the **Command Window**. For a quick preview, portions of a dataframe can be printed directly to screen using indexing. For example, the first five rows the `auto2` dataframe can be printed to screen by entering :: auto2[1:5, .]; This is equivalent to using the ``list`` command in Stata .. code-block:: Stata list 1/5 If we only wanted to view the first five rows of the variable ``mpg`` from the ``auto2`` dataframe, we would use :: auto2[1:5, "mpg"]; which is equivalent to .. code-block:: Stata list mpg 1/5 In GAUSS, you can also preview the beginning or end of your data using the :func:`head` or :func:`tail` functions, respectively. For example, to view the first five rows of ``make``, ``price``, and ``mpg`` in the dataframe ``auto2``: :: head(auto2[., "make" "price" "mpg"]); This prints :: make price mpg AMC Concord 4099.0000 22.000000 AMC Pacer 4749.0000 17.000000 AMC Spirit 3799.0000 22.000000 Buick Century 4816.0000 20.000000 Buick Electra 7827.0000 15.000000 We can include an optional input to indicate how many rows to include. A positive number specifies how many rows to print. For example, to print the first ten rows: :: head(auto2[., "make" "price" "mpg"], 10); This prints the first ten rows: :: make price mpg AMC Pacer 4749.0000 17.000000 Buick Century 4816.0000 20.000000 Buick Electra 7827.0000 15.000000 Buick LeSabre 5788.0000 18.000000 Buick Opel 4453.0000 26.000000 Buick Regal 5189.0000 20.000000 Buick Riviera 10372.000 16.000000 Buick Skylark 4082.0000 19.000000 A negative number indicates how many rows to skip before beginning printing. For example, to print everything after the first 10 rows of data: :: head(auto2[., "make" "price" "mpg"], -10); Data Operations -------------------- Indexing matrices and dataframes ++++++++++++++++++++++++++++++++++++ GAUSS uses square brackets ``[]`` for indexing matrices. The indices are listed row first, then column, with a comma separating the two. For example, to index the element in the 3rd row and 7th column of the matrix ``x``, we use: :: x[3, 7]; To select a range of columns or rows with numeric indices, GAUSS uses the `:` operator: :: x[3:6, 7]; GAUSS also allows you to use variable names in a dataframe for indexing. As an example, if we want to access the 3rd observation of the variable ``mpg`` in the ``auto2`` dataframe, we use: :: auto2[3, "mpg"]; You can also select multiple variables using a space separated list: :: auto2[3, "mpg" "make"]; Finally, GAUSS allows you index an entire column or row using the ``.`` operator. For example, to see all observations of the variable ``mpg`` in the ``auto2`` dataframe, we use: :: auto2[., "mpg"]; Operations on variables +++++++++++++++++++++++++ In Stata, ``generate`` and ``replace`` are required to either transform existing variables or generate new variables using existing variables: .. code-block:: Stata replace total_bill = total_bill - 2 generate new_bill = total_bill / 2 In GAUSS, these operations are performed using operators. For example, GAUSS uses: * The ``-`` operator to subtract values. * The ``/`` operator to divide values. * The ``=`` to assign the new values to a storage location. * The ``~`` to add new columns to a matrix or dataframe. :: // Subtract 2 from all observations of the // variable 'total_bill' in the 'tips2' dataframe tips2[., "total_bill"] = tips2[., "total_bill"] - 2; // Divide all observations of the variable // 'total_bill' in the 'tips2' dataframe by 2 tips2[., "total_bill"] = tips2[., "total_bill"] / 2; // Divide all observations of the variable // 'total_bill' in the 'tips2' dataframe by 2 // and generate 'new_bill' tips2 = tips2 ~ dfname(tips2[.,"total_bill"] / 2, "new_bill"); Matrix operations +++++++++++++++++++ GAUSS is a matrix based language and matrix operations play a fundamental role in GAUSS computations. **Common Matrix Operators** +--------------------+-----------------------+-------------------------+ |Description | GAUSS | Stata | +====================+=======================+=========================+ |Matrix multiply | ``z = x * y;`` | ``matrix z = x*y`` | +--------------------+-----------------------+-------------------------+ |Solve system of | ``b = y / x;`` | ``matrix b = y*inv(x)`` | |linear equations | | | +--------------------+-----------------------+-------------------------+ |Kronecker product | ``z = x .*. y;`` | ``matrix z = x#y`` | +--------------------+-----------------------+-------------------------+ |Matrix transpose | ``z = x';`` | ``matrix z = x’`` | +--------------------+-----------------------+-------------------------+ When dealing with matrices, it is important to distinguish matrix operations from element-by-element operations. In Stata, element-by-element operations are specified with a colon ``:``. In GAUSS, element-by-element operations are specified by a dot ``.``. **Element-by-element (ExE) Operators** +---------------------------------+-----------------------+-------------------------+ |Description | GAUSS | Stata | +=================================+=======================+=========================+ |Element-by-element multiply | ``z = x .* y;`` | ``matrix z = x:*y`` | +---------------------------------+-----------------------+-------------------------+ |Element-by-element divide | ``z = y ./ x;`` | ``matrix z = y:/x`` | +---------------------------------+-----------------------+-------------------------+ |Element-by-element exponentiation| ``z = x .^ y;`` | ``matrix z = x:^y`` | +---------------------------------+-----------------------+-------------------------+ |Element-by-element addition | ``z = x + y;`` | ``matrix z = x + y`` | +---------------------------------+-----------------------+-------------------------+ |Element-by-element subtraction | ``z = x - y;`` | ``matrix z = x - y`` | +---------------------------------+-----------------------+-------------------------+ For a more in depth look at how matrix operation works in GAUSS you may want to review our blogs: * `GAUSS Basics 3: Intro to Matrices `_ * `GAUSS Basics 4: Matrix Operations `_ * `GAUSS Basics 5: Element-by-Element Conformability `_ Filtering +++++++++++++++++++ In Stata, data is filtered using an ``if`` clause when using other commands. For example, to keep all observations where ``total_bill`` is greater than 10 we use: .. code-block:: Stata keep if total_bill > 10 In GAUSS this can be done interactively with the **Data Management Tool**: .. figure:: ../_static/images/filtering-tips.jpg :width: 80% Programmatically this is done using the :func:`selif` procedure: :: // Select observations from the tips2 dataframe // where the total_bill variable is greater than 10 tips2 = selif(tips2, tips2[., "total_bill"] .> 10); More information about filtering data can be found in: * The `Interactive Data Cleaning section `_ of the Data Management Guide. * `Preparing and Cleaning FRED data in GAUSS `_ * `Getting to Know Your Data with GAUSS 22 `_ Selection of data +++++++++++++++++++ Stata allows you to select, drop, or rename columns using command line keywords: .. code-block:: Stata keep sex total_bill tip drop sex rename total_bill total_bill_2 In GAUSS, the same can be done using the **Data Management** pane. .. figure:: ../_static/images/data-cleaning-open-symbol-editor-filter.jpg To open the **Data Management** pane: 1. Double-click the name of the dataframe in the **Symbols** window on the **Data** page. 2. Click the **Manage** button with the cog icon on the top right of the open **Symbol Editor** window. Select columns from a dataframe ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Columns can be selected or removed from the dataframe using the **Variables** list. * If a variable has a check box next to the name of the variables it is included in the dataframe. * To clear the variable from the dataframe clear the check box next to the variable name. These changes will not be made until you click **Apply**. Changing variable names ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Variable names can also be changes from the **Variables** list. .. figure:: ../_static/images/data-organization-rename-variable.jpg :scale: 50% 1. Double-click the dataframe you want to modify in the **Symbols** pane of the **Data** page. 2. Click the **Manage** button at the top right of the open **Symbol Editor**. 3. Click downward pointing triangle button to the right of the name of the variable name you want to change and select **Rename**. 4. Enter the new name in the **Name** text box. These changes will not be made until you click **Apply**. GAUSS also offers programmatic options for selecting data and changing variable names: :: // Keep only 'total_bill' 'tip' and 'sex' tips2 = tips2[., "total_bill" "tip" "sex"]; // Drop sex variable tips2 = delcols(tips2, "sex"); // Rename variable 'total_bill' to 'total_bill_2' tips2 = dfname(tips2, "total_bill_2", "total_bill"); Sorting ++++++++++++++++ In Stata the ``sort`` command is used for sorting data: .. code-block:: Stata sort sex total_bill In GAUSS, this is done using :func:`sortc`. We can accomplish the same sorting as the Stata line above using: :: // Sort the 'tips2' dataframe based // on 'sex' and 'total_bill' variables tips2 = sortc(tips2, "sex" $| "total_bill"); Date Functionality -------------------- GAUSS dataframes include a date data type which makes it convenient to read, format, and use dates in analysis. Date variables can be loaded interactively using the **Data Import** window or programmatically using :func:`loadd` and the ``date`` keyword. Creating usable dates from raw data ++++++++++++++++++++++++++++++++++++++ In Stata, dates are most often imported as strings from raw data. They must then be converted to usable date types using the ``date()`` function and a readable format is set using ``format``. For example, when the ``yellowstone.csv`` dataset is imported into Stata, the variable ``date`` is a string variable The ``date`` variable must be converted to a date type: .. code-block:: Stata generate date_var = date(date, "YMD"); and the viewing format should be set .. code-block:: Stata format date_var %d. In GAUSS, dates can be directly read in as date variables using the :func:`loadd` procedure and the ``date`` keyword. The :func:`loadd` procedure automatically detects common date formats and doesn’t require a format specification unless a custom format is being used in the raw data: :: // Create filename fname = getGAUSSHome $+ "examples/yellowstone.csv"; // Load the variable Visits, LowtTep, HighTemp and Date // from the file 'yellowstone.csv' yellowstone = loadd(fname, "Visits + LowtTemp + HighTemp + date($Date)"); .. figure:: ../_static/images/yellowstone-dates.jpg :width: 80% Creating dates from existing strings ++++++++++++++++++++++++++++++++++++++ The GAUSS :func:`asDate` procedure works similarly to the Stata ``date()`` function and can be used to convert strings to dataframe dates. For example, suppose we want to convert the string ``"2002-10-01"`` to a date in Stata: .. code-block:: Stata generate date_var = date("2002-10-01", "YMD") When we do this in Stata the data is displayed in the date numeric format and we have to use the ``format`` command to change the display format: .. code-block:: Stata format date_var %d In GAUSS, this is done using the :func:`asDate` procedure: :: // Convert string date '2002-10-01' // to a date variable date_var = asDate("2002-10-01"); The :func:`asDate` procedure automatically recognizes dates in the format ``"YYYY-MM-DD HH:MM:SS"``. However, if the date is in a different format, a format string can be used: :: // Convert string date '10/01/2002' // to a date variable date_var = asDate("10/01/2002", "%d/%m/%Y"); Changing the display format ++++++++++++++++++++++++++++++++++++++ Once a date variable has been imported or created, the display format can be specified interactively using the GAUSS **Data Management Tool**. The **Specify Date Format** dialog is accessed by selecting **Properties** from the variable's dropdown: .. figure:: ../_static/images/interactive-data-cleaning-variable-properties.jpg :width: 60% If the variable is a date variable, the **Specify Date Format** window will open: .. figure:: ../_static/images/select-date-format.jpg :width: 60% Dates can be managed programmatically using :func:`asDate`: :: // Convert 'Date' variable from string variable // to date variable yellowstone = asdate(yellowstone, "%b-%d-%Y", "Date"); String Processing ------------------- Finding the length of a string +++++++++++++++++++++++++++++++ The ``strlen()`` and ``ustrlen()`` functions are used in Stata to find the length of strings: .. code-block:: Stata generate strlen_time = strlen(time) generate ustrlen_time = ustrlen(time) GAUSS also uses a :func:`strlen()` procedure to find string lengths: :: // Find length of all observations // of the variable 'time' in // the 'tips2' dataframe strlen_time = strlen(tips2[., "time"]); Finding the position of a substring +++++++++++++++++++++++++++++++++++++++ Finding the position of strings can be useful for data searching and cleaning. In Stata, the ``strpos()`` function allows you to find the location of a specified substring within another string: .. code-block:: Stata generate str_position = strpos(sex, "ale") In GAUSS, this is done using the :func:`strindx()` or :func:`strrindx()` procedures. The :func:`strindx()` procedure searches from the beginning of the string and the :func:`strrindx()` procedure searches from the end of the string. The functions require two inputs: * *where* (string or scalar) – the data to be searched. * *what* (string or scalar) – the substring to be searched for in *where*. For example consider the ``sex`` variable in the ``tips2`` dataframe. The first ten observations are: :: tips2[1:10, "sex"]; sex Female Male Male Male Female Male Male Male Male Male :: // Find the location of the substring 'ale' // in the variable 'sex' in the 'tips2' dataframe str_pos = strindx(tips2[., "sex"], "ale"); // Display the first 10 observations of // all variables in 'str_pos' str_pos[1:10, .]; The printed result is: :: 4.0000000 2.0000000 2.0000000 2.0000000 4.0000000 2.0000000 2.0000000 2.0000000 2.0000000 2.0000000 Extracting a substring by position ++++++++++++++++++++++++++++++++++++ In Stata, the ``substr()`` function is used to extract substrings from a string. The ``substr()`` function uses position and string length to specify which substring to extract: .. code-block:: Stata generate short_sex = substr(sex, 1, 1) The same thing can be done in GAUSS using the :func:`strsect()`: :: // Extract first letter from // the variable 'sex' in the // 'tips2' dataframe short_sex = strsect(tips2[., "sex"], 1, 1); short_sex[1:5, .]; The printed result is: :: sex F M M M F Extracting words ++++++++++++++++++ Stata allows you to extract the nth word from a string using the :func:`word()` function. For example, to consider if we wish to separate the first and last names from a name into two variables. .. code-block:: Stata clear input str20 name "John Smith" "Jane Cook" end generate first_name = word(name, 1) generate last_name = word(name, -1) While GAUSS doesn’t have an exactly analogous function, this can be done fairly easily using the :func:`strsplit` procedure. The :func:`strsplit` procedure splits the string using an optional specified separator. If no separator is provided, :func:`strsplit` separates strings based on spaces. For example: :: // Generate string array of names string name = { "John Smith", "Jane Cook" }; // Split into two strings // and name variables 'first_name' and 'last_name' name_split = asDF(strsplit(name), "first_name", "last_name"); This creates the ``name_split`` dataframe: :: first_name last_name John Smith Jane Cook If the original name data has first, middle, and last names, all separated by spaces, then :func:`strsplit` will split the strings into three columns: :: // Generate string array of names string full_name = { "John Robert Smith", "Jane Elizabeth Cook" }; // Split into three strings // and name variables 'first_name', 'middle_name', and 'last_name' name_split = asDF(strsplit(full_name), "first_name", "middle_name", "last_name"); Now the ``name_split`` variable contains three variables: :: first_name middle_name last_name John Robert Smith Jane Elizabeth Cook Finally, suppose our names are separated by a comma and a space, instead of a space: :: // Generate string array of names string name = { "Smith,John", "Cook,Jane" }; // Split into two strings using ', ' as a separator // and name variables 'last_name' and 'first_name' name_split = asDF(strsplit(name, ", "), "last_name", "first_name"); Now our ``name_split`` variable is: :: last_name first_name Smith John Cook Jane Changing case ++++++++++++++++++++ GAUSS uses the :func:`upper` and :func:`lower` procedures to change all letters in strings to uppercase and lowercase, respectively. For example: :: // Change time variable in 'tips2' to all uppercase tips2[., "time"] = upper(tips2[., "time"]); // Change sex variable in 'tips2' to all lowercase tips2[., "sex"] = lower(tips2[., "sex"]); This compares to the ``strupper()`` and ``strlower()`` functions in Stata, which change all letters in a string to uppercase and lowercase, respectively. .. code-block:: Stata generate upper_time = strupper(time) generate lower_sex = strlower(sex) Missing values ------------------- Missing values are represented by the same dot notation, ``.``, in both Stata and GAUSS. This notation can be used for filtering data Stata: .. code-block:: Stata * Keep missing values list if value_x == . * Keep non-missing values list if value_x != . In GAUSS missing values can be created with a statement or using the :func:`error` function: :: // Keep missing values mss = { . }; data = selif(data, data[., "x"] .== mss)); // Keep non-missing values data = selif(data, data[., "x"] .!= error(0)); Counting missing values ++++++++++++++++++++++++++ In Stata, missing values in individual variables can be counted using the ``count`` command. This command works with a logical statement specifying what condition is to be counted: .. code-block:: Stata count if rep78 == . In GAUSS, missing values can be counted using the :func:`counts` function and ``error(0)``: :: counts(auto2[., "rep78"], error(0)); This finds how many missing values there are in the ``rep78`` variable, found in the ``auto2`` dataframe: :: 5.0000000 Alternatively, missing values are counted as part of the descriptive statistics using :func:`dstatmt`: :: // Get descriptive statistics call dstatmt(auto2); This returns :: --------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing --------------------------------------------------------------------------------------------- make ----- ----- ----- ----- ----- 74 0 price 6165 2949 8.7e+06 3291 1.591e+04 74 0 mpg 21.3 5.786 33.47 12 41 74 0 rep78 ----- ----- ----- Poor Excellent 69 5 headroom 2.993 0.846 0.7157 1.5 5 74 0 trunk 13.76 4.277 18.3 5 23 74 0 weight 3019 777.2 6.04e+05 1760 4840 74 0 length 187.9 22.27 495.8 142 233 74 0 turn 39.65 4.399 19.35 31 51 74 0 displacement 197.3 91.84 8434 79 425 74 0 gear_ratio 3.015 0.4563 0.2082 2.19 3.89 74 0 foreign ----- ----- ----- Domestic Foreign 74 0 Removing missing values ++++++++++++++++++++++++ GAUSS provides two options for removing missing values from a matrix: * The :func:`packr()` procedure removes all rows from a matrix that contain any missing values. * The :func:`delif()` procedure removes all rows which meet a particular condition. :: // Create matrix a = { 1 ., . 4, 5 6 }; // Remove all rows with a missing value print packr(a); will return :: 5 6 Conversely :: // Create matrix a = { 1 ., . 4, 5 6 }; // Remove all rows with a missing value // in the second column print delif(a, a[., 2] .== error(0)); will only delete rows with a missing value in the second column :: . 4 5 6 Replacing missing values ++++++++++++++++++++++++++ GAUSS also provides two functions for replacing missing values: * The :func:`missrv` function. * The :func:`impute` function. The :func:`missrv` function replaces all missing values in a matrix with a user-specified value :: // Create matrix a = { 1 ., . 4, 5 6 }; // Replace all missing values with -999 print missrv(a, -999); returns :: 1 -999 -999 4 5 6 This is similar to using the replace variable in Stata .. code-block:: Stata replace a = -999 if a >= . The :func:`impute()` procedure replaces missing values in the columns of a matrix using a specified imputation method. The procedure offers six potential methods for imputation: * ``"mean"`` - replaces missing values with the mean of the column. * ``"median"`` - replaces missing values with the median of the column. * ``"mode"`` - replace missing values with the mode of the column. * ``"pmm"`` - replaces missing values using predictive mean matching. * ``"lrd"`` - replaces missing values using local residual draws. * ``"predict"`` - replaces missing values using linear regression prediction. More details about dealing with missing values are available in: * `The Introduction to Handling Missing Values blog. `_ * `The Data Cleaning section `_ of the GAUSS Data Management Guide. Merging ---------------- In Stata merging: * Is performed using the ``merge`` command. * Is done using a dataset in memory and a data file on disk. * Keeps all data from the data in memory and the `using` data. * Creates a ``_merge`` variable indicating if the data point from the original data, the ``using`` data, or the intersection of the two. * Allows for one-to-one, one-to-many, many-to-one, and many-to-many joining operations. In GAUSS merging: * Is done using the :func:`outerJoin` or :func:`innerJoin` procedures. * Is done completely with data in memory. * The :func:`innerJoin` function only keeps matching observations. * The :func:`outerJoin` function keeps observations either from both data sources or the left-hand data source. * Allows for one-to-one, one-to-many, many-to-one, and many-to-many joining operations. As a first example, let’s consider two dataframes. The first contains ``ID`` and ``Age``: :: ID Age John 22 Mary 18 Susan 34 Connie 45 The second contains ``ID`` and ``Occupation``: :: ID Occupation John Teacher Mary Surgeon Susan Developer Tyler Nurse In Stata, we merge these using ``merge()``: .. code-block:: Stata * Create and save the age dataset clear input str10 ID John Doe Mary Jane Susan Smith Connie Lee end input age 22 18 34 45 end save df1.dta * Now create occupation data * and keep in memory clear input str10 ID John Mary Susan Tyler end input str10 occupation Teacher Surgeon Developer Nurse end merge 1:1 ID using df1 We can do the same in GAUSS using :func:`outerJoin`: :: // Create ID strings string ID1 = { "John", "Mary", "Susan", "Connie" }; string ID2 = { "John", "Mary", "Susan", "Tyler" }; // Create age vector age = { 22, 18, 34, 45 }; // Create occupation string string Occupation = { "Teacher", "Surgeon", "Developer", "Nurse" }; // Create first df df1 = asDF(ID1, "ID") ~ asDF(age, "Age"); // Create second df df2 = asDF(ID2, "ID") ~ asDF(Occupation, "Occupation"); // Merge dataframes df3 = outerJoin(df2, "ID", df1, "ID", "full"); The ``df3`` dataframe contains: :: ID Occupation Age John Teacher 22.000000 Mary Surgeon 18.000000 Susan Developer 34.000000 Tyler Nurse . Connie . 45.000000 The ``df3`` dataframe contains all observations from both the ``df1`` and ``df2`` dataframes, even if they aren't matched, because we included the ``"full"`` option. If we just wanted to keep the matches to the keys from the ``df2`` dataframe we would exclude the ``"full"`` option: :: // Merge dataframes df3 = outerJoin(df2, "ID", df1, "ID"); Now ``df3`` includes: :: ID Occupation Age John Teacher 22.000000 Mary Surgeon 18.000000 Susan Developer 34.000000 Tyler Nurse .