dfLonger#

Purpose#

Converts a GAUSS dataframe in wide panel format to long panel format.

Format#

df_long = dfLonger(df_wide, columns, names_to, values_to[, pctl])#
Parameters:
  • df_wide (Dataframe) – A GAUSS dataframe in wide panel format.

  • columns (String array) – The columns that should be used in the conversion.

  • names_to (String array) – Specifies the variable name(s) for the new column(s) created to store the wide variable names.

  • values_to (String) – The name of the new column containing the values.

  • pctl (Struct) –

    An optional pivotControl structure with the following members:

    pctl.names_prefix

    String, the characters, if any, that should be stripped from the front of the wide variable names before they are assigned to a long column. Default = “”, no prefix.

    pctl.names_sep_split

    String, the character(s), if any, that mark where the names_to names should be broken up. Default = “”, do not break up names_to.

    pctl.names_pattern_split

    String, a regular expression specifying groups that mark where the names_to names should be broken up. Default = “”, do not break up names_to.

    pctl.names_types

    String, containing either: i. a column vector of types for each of the names_to variables, or ii. an n x 2 matrix where the first column is the name of the column and the second column contains the types for the variable in the first column. (i.e. pctl.names_types = { "Index" "number", "Year" "date" };). Valid type options include: “date”, “number”, “category”, and “string”.

    pctl.values_drop_missing

    Scalar, 0 or 1. If set to 1, all rows with missing values will be removed. Default = 0.

Returns:

df_long (Dataframe) – The input data converted to long form.

Examples#

Example 1: Basic default case#

Step one, load the data.

// Load data
file_name = getGAUSSHome("examples/tiny_car_panel.csv");
df_wide = loadd(file_name);

print df_wide;
     Years     Cars_compact       Cars_truck         Cars_SUV
1973-01-01        5.0000000                .        3.0000000
1974-01-01        2.0000000        1.0000000        9.0000000

Step two, get the names of the columns we want to pivot to long format.

// Get all column names and remove the first column name, 'Years'
columns = getcolnames(df_wide);
columns = trimr(columns, 1, 0);
print columns;
Cars_compact
  Cars_truck
    Cars_SUV

Step three, perform the pivot.

names_to = "Class";
values_to = "Count";

df_long = dfLonger(df_wide, columns, names_to, values_to);

After the above code, df_long will equal:

     Years            Class            Count
1973-01-01     Cars_compact        5.0000000
1973-01-01       Cars_truck                .
1973-01-01         Cars_SUV        3.0000000
1974-01-01     Cars_compact        2.0000000
1974-01-01       Cars_truck        1.0000000
1974-01-01         Cars_SUV        9.0000000

Example 2: Basic case with names_prefix#

You may notice that the elements in the Class variable from our previous example contain a redundant prefix, Cars_. We can remove that by using the pivotControl structure and setting the names_prefix member to Class_.

// Load data
file_name = getGAUSSHome("examples/tiny_car_panel.csv");
df_wide = loadd(file_name);

print df_wide;
     Years     Cars_compact       Cars_truck         Cars_SUV
1973-01-01        5.0000000                .        3.0000000
1974-01-01        2.0000000        1.0000000        9.0000000
// Get all column names and remove the first column name, 'Years'
columns = getcolnames(df_wide);
columns = trimr(columns, 1, 0);

names_to = "Class";
values_to = "Count";

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

pctl.names_prefix = "Cars_";

// Call dfLonger with optional control structure
df_long = dfLonger(df_wide, columns, names_to, values_to, pctl);

This time, our Class variable will not contain the redundant prefix as we see below:

     Years       Class            Count
1973-01-01     compact        5.0000000
1973-01-01       truck                .
1973-01-01         SUV        3.0000000
1974-01-01     compact        2.0000000
1974-01-01       truck        1.0000000
1974-01-01         SUV        9.0000000

Example 3: Advanced options: splitting variable names and setting variable types#

In this example, we will use the names_sep_split member of the pivotControl structure to break up the variable names. We will also use the names_types member to set the types for the newly created long form variables.

// Load the data
df_wide = loadd(getGAUSSHome("examples/olympic_vault_wide.csv"));
print df_wide;
      Country     vault_2012_f     vault_2012_m     vault_2016_f     vault_2016_m
United States        48.100000        46.600000        46.900000        45.900000
       Russia        46.400000        46.900000        45.700000        46.000000
        China        44.300000        48.300000        44.300000        45.000000
// Get the list of variables to pivot
// and remove the first column name, 'Country'
columns =  getcolnames(df_wide);
columns = trimr(columns, 1, 0);

print columns;
vault_2012_f
vault_2012_m
vault_2016_f
vault_2016_m

Next we will declare our pivotControl structure and specify that we want to split the pivot variable names at each underscore.

// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

// Split the variable names from 'columns', i.e. vault_2012_f, etc
// every time an underscore is encountered
pctl.names_sep_split = "_";

Looking at the variable names we just printed earlier, we can see that if we split them at each underscore, we will end up with three separate tokens from each name. Below we specify a names_to for each of these tokens.

// Set variable names for the new columns
names_to = "event" $| "year" $| "gender";

// Set name of value column
values_to = "score";

Our final setting is to specify any data type changes we would like for any of the names_to variables. Since they start as variable names, their default type is category. We will specify that dflonger() should change the variable year to be of type date with the code below.

// Convert 'year' to be a date variable.
pctl.names_types = { "year" "date" };

Now we call dfLonger() with the inputs we created and print out the results.

df_long = dfLonger(df_wide, columns, names_to, values_to, pctl);
print df_long;
      Country            event             year           gender            score
United States            vault             2012                f        48.100000
United States            vault             2012                m        46.600000
United States            vault             2016                f        46.900000
United States            vault             2016                m        45.900000
       Russia            vault             2012                f        46.400000
       Russia            vault             2012                m        46.900000
       Russia            vault             2016                f        45.700000
       Russia            vault             2016                m        46.000000
        China            vault             2012                f        44.300000
        China            vault             2012                m        48.300000
        China            vault             2016                f        44.300000
        China            vault             2016                m        45.000000

We can verify the column types by using the getcoltypes() function.

print getcoltypes(df_long, "event"$|"year"$|"gender")
    type
category
    date
category

Example 4: Advanced options: splitting variable names with regular expressions#

In this example, we will take two pairs of variable names, Brand1 plus Brand2 and Price1 plus Price2 and:

  1. Split the variable names by word (Brand or Price) and integer (1 or 2).

  2. Create columns from the numberless column names.

  3. Create a new variable named Index containing the integers from the end of the variable names.

// Create file name with full path
file = getGAUSSHome("examples/chocolate_choice_wide_short.csv");

// Load specified variables (specifying the Brand variables to be categorical)
df_choc = loadd(file, "Subject + Trial + Selection + cat(Brand1) + cat(Brand2) + Price1 + Price2");

// Print the first 5 rows
head(df_choc);
Subject      Trial  Selection     Brand1     Brand2     Price1     Price2
   2401          1          1       Dove     Godiva        0.6        0.7
   2401          2          2     Godiva     Godiva        2.7        3.9
   2401          3          2  Hershey's     Godiva        1.7        3.7
   2401          4          1      Lindt      Lindt          1        3.6
   2401          5          2  Hershey's     Godiva        0.8        1.5

Next we will specify the columns input as in our previous examples:

// Column names that will be split
columns = "Brand1" $| "Brand2" $| "Price1" $| "Price2";

Our names_to input will be a little different this time, however.

names_to = ".value" $| "Index";
values_to = "";

Since we will be splitting the variable names into 2 pieces (i.e. Brand1 -> Brand 1), we need to set one element of names_to for each of the pieces from the split variable name.

.value

The first element is “.value”. This tells dflonger() to take the first piece of the variable name (Brand or Price) and create a column with the all the values from all matching columns.

Brand1

Brand2

Price1

Price2

Dove

Godiva

0.6

0.7

Godiva

Godiva

2.7

3.9

Brand

Price

Dove

0.6

Godiva

0.7

Godiva

2.7

Godiva

3.9

In other words, combine all the values from the variables Brand1 and Brand2 into a single variable named Brand and do the same for the Price columns.

Index

The second element of names_to tells dflonger() to create a column named Index and fill it with the contents of the second piece of the variable names (i.e 1 or 2).

Since names_to is specifying where to send the “values”, values_to will be empty.

Regex to split the variable names

Now we can set our other options using the pivotControl structure.

// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

The names_pattern_split member of the pivotControl structure is where we can assign a string with a regular expression that will split the columns we specified earlier. A full description of regular expressions is beyond our scope here, however, the most important thing to know is that each statement inside a pair of parentheses is a group. The name will be split by group.

The first group is (Brand|Price). That will match either “Brand” or “Price”. If we had several variable names and did not want to explicitly list them all, we could make our first group ([a-zA-Z]). That would match any upper or lower case characters.

Our second group is ([0-9]). That will match any integer.

// Set the regex to split the variable names
pctl.names_pattern_split = "(Brand|Price)([0-9])";

Change output variable types

By default the variables created from the pieces of the variable names will be categorical variables. Since the second peice of our variable, that we set to be called Index earlier when we set names_to, will be integers, we may not want it to be a categorical variable. So for this example, we will tell GAUSS to make it a numerical variable.

pctl.names_types = { "Index" "number" };

Pivot the data

Now we can call dflonger() with the inputs we have created.

// Convert the dataframe to long format according to our specifications
df_long = dfLonger(df_choc, columns, names_to, values_to, pctl);

// Print the first 5 rows of the long form dataframe
head(df_long);
Subject      Trial  Selection      Index      Brand      Price
   2401          1          1          1       Dove        0.6
   2401          1          1          2     Godiva        0.7
   2401          2          2          1     Godiva        2.7
   2401          2          2          2     Godiva        3.9
   2401          3          2          1  Hershey's        1.7

See also

Functions dfwider()