dfLonger
==============================================

Purpose
----------------
Converts a GAUSS dataframe in wide panel format to long panel format.


Format
----------------
.. function:: df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl])

    :param df_wide: A GAUSS dataframe in wide panel format.
    :type df_wide: Dataframe

    :param columns:  The columns that should be used in the conversion.
    :type columns: String array

    :param names_to: Specifies the variable name(s) for the new column(s) created to store the wide variable names.
    :type names_to: String array

    :param values_to: The name of the new column containing the values.
    :type values_to: String

    :param pctl: An optional :class:`pivotControl` structure with the following members:

        .. list-table::
            :widths: auto

            * - pctl.names_prefix
              - String, the characters, if any, that should be stripped from the front of the wide variable names before they are assigned to a long column.  Default = "", no prefix.
            * - pctl.names_sep_split
              - String, the character(s), if any, that mark where the *names_to* names should be broken up. Default = "", do not break up *names_to*.
            * - pctl.names_pattern_split
              - String, a regular expression specifying groups that mark where the *names_to* names should be broken up. Default = "", do not break up *names_to*.
            * - pctl.names_types
              - String, containing either: i. a column vector of types for each of the *names_to* variables, or ii. an *n x 2* matrix where the first column is the name of the column and the second column contains the types for the variable in the first column. (i.e. ``pctl.names_types = { "Index" "number", "Year" "date" };``). Valid type options include: "date", "number", "category", and "string".
            * - pctl.values_drop_missing
              - Scalar, 0 or 1. If set to 1, all rows with missing values will be removed. Default = 0.

    :type pctl: Struct

    :return df_long: The input data converted to long form.

    :rtype df_long: Dataframe

Examples
----------------

Example 1: Basic default case
+++++++++++++++++++++++++++++++++

Step one, load the data.

::

    // Load data
    file_name = getGAUSSHome("examples/tiny_car_panel.csv");
    df_wide = loadd(file_name);

    print df_wide;

::

           Years     Cars_compact       Cars_truck         Cars_SUV
      1973-01-01        5.0000000                .        3.0000000
      1974-01-01        2.0000000        1.0000000        9.0000000


Step two, get the names of the columns we want to pivot to long format.

::

    // Get all column names and remove the first column name, 'Years'
    columns = getcolnames(df_wide);
    columns = trimr(columns, 1, 0);
    print columns;

::

    Cars_compact
      Cars_truck
        Cars_SUV


Step three, perform the pivot.

::

    names_to = "Class";
    values_to = "Count";

    df_long = dfLonger(df_wide, columns, names_to, values_to);

After the above code,  *df_long* will equal:

::

         Years            Class            Count
    1973-01-01     Cars_compact        5.0000000
    1973-01-01       Cars_truck                .
    1973-01-01         Cars_SUV        3.0000000
    1974-01-01     Cars_compact        2.0000000
    1974-01-01       Cars_truck        1.0000000
    1974-01-01         Cars_SUV        9.0000000


Example 2: Basic case with `names_prefix`
+++++++++++++++++++++++++++++++++++++++++++++

You may notice that the elements in the *Class* variable from our previous example
contain a redundant prefix, *Cars_*. We can remove that by using the :class:`pivotControl` structure
and setting the *names_prefix* member to *Class_*.

::

     // Load data
     file_name = getGAUSSHome("examples/tiny_car_panel.csv");
     df_wide = loadd(file_name);

     print df_wide;

::

           Years     Cars_compact       Cars_truck         Cars_SUV
      1973-01-01        5.0000000                .        3.0000000
      1974-01-01        2.0000000        1.0000000        9.0000000

::

    // Get all column names and remove the first column name, 'Years'
    columns = getcolnames(df_wide);
    columns = trimr(columns, 1, 0);

    names_to = "Class";
    values_to = "Count";

    // Declare pivotControl structure and fill with default values
    struct pivotControl pctl;
    pctl = pivotControlCreate();

    pctl.names_prefix = "Cars_";

    // Call dfLonger with optional control structure
    df_long = dfLonger(df_wide, columns, names_to, values_to, pctl);


This time, our *Class* variable will not contain the redundant prefix as we see below:

::

         Years       Class            Count
    1973-01-01     compact        5.0000000
    1973-01-01       truck                .
    1973-01-01         SUV        3.0000000
    1974-01-01     compact        2.0000000
    1974-01-01       truck        1.0000000
    1974-01-01         SUV        9.0000000


Example 3: Advanced options: splitting variable names and setting variable types
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In this example, we will use the *names_sep_split* member of the :class:`pivotControl` structure to break up the
variable names. We will also use the *names_types* member to set the types for the newly created long form variables.

::

  // Load the data
  df_wide = loadd(getGAUSSHome("examples/olympic_vault_wide.csv"));
  print df_wide;   

::

         Country     vault_2012_f     vault_2012_m     vault_2016_f     vault_2016_m 
   United States        48.100000        46.600000        46.900000        45.900000 
          Russia        46.400000        46.900000        45.700000        46.000000 
           China        44.300000        48.300000        44.300000        45.000000

::

  // Get the list of variables to pivot
  // and remove the first column name, 'Country'
  columns =  getcolnames(df_wide);
  columns = trimr(columns, 1, 0);

  print columns;

::

    vault_2012_f 
    vault_2012_m 
    vault_2016_f 
    vault_2016_m


Next we will declare our :class:`pivotControl` structure and specify that we want to split the pivot variable names at each underscore.

::
  
  // Declare 'pctl' to be a pivotControl structure
  // and fill with default settings
  struct pivotControl pctl;
  pctl = pivotControlCreate();
  
  // Split the variable names from 'columns', i.e. vault_2012_f, etc
  // every time an underscore is encountered
  pctl.names_sep_split = "_";


Looking at the variable names we just printed earlier, we can see that if we split them at each underscore, we will end up with three separate tokens from each name. Below
we specify a *names_to* for each of these tokens.

::

  // Set variable names for the new columns
  names_to = "event" $| "year" $| "gender";
  
  // Set name of value column
  values_to = "score";


Our final setting is to specify any data type changes we would like for any of the  *names_to* variables. Since they start as variable names, their default type is *category*. We will specify that :func:`dflonger` should change the variable *year* to be of type *date* with the code below. 

::
  
  // Convert 'year' to be a date variable.
  pctl.names_types = { "year" "date" };


Now we call :func:`dfLonger` with the inputs we created and print out the results.

::
  
  df_long = dfLonger(df_wide, columns, names_to, values_to, pctl); 
  print df_long;

::

         Country            event             year           gender            score 
   United States            vault             2012                f        48.100000 
   United States            vault             2012                m        46.600000 
   United States            vault             2016                f        46.900000 
   United States            vault             2016                m        45.900000 
          Russia            vault             2012                f        46.400000 
          Russia            vault             2012                m        46.900000 
          Russia            vault             2016                f        45.700000 
          Russia            vault             2016                m        46.000000 
           China            vault             2012                f        44.300000 
           China            vault             2012                m        48.300000 
           China            vault             2016                f        44.300000 
           China            vault             2016                m        45.000000

We can verify the column types by using the :func:`getcoltypes` function.

::

  print getcoltypes(df_long, "event"$|"year"$|"gender")

::

            type 
        category 
            date 
        category 
 

Example 4: Advanced options: splitting variable names with regular expressions
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In this example, we will take two pairs of variable names, *Brand1* plus *Brand2* and *Price1* plus *Price2* and:

1. Split the variable names by word (Brand or Price) and integer (1 or 2).
2. Create columns from the numberless column names.
3. Create a new variable named *Index* containing the integers from the end of the variable names.

::

    // Create file name with full path
    file = getGAUSSHome("examples/chocolate_choice_wide_short.csv");

    // Load specified variables (specifying the Brand variables to be categorical)
    df_choc = loadd(file, "Subject + Trial + Selection + cat(Brand1) + cat(Brand2) + Price1 + Price2");

    // Print the first 5 rows
    head(df_choc);

::

   Subject      Trial  Selection     Brand1     Brand2     Price1     Price2 
      2401          1          1       Dove     Godiva        0.6        0.7 
      2401          2          2     Godiva     Godiva        2.7        3.9 
      2401          3          2  Hershey's     Godiva        1.7        3.7 
      2401          4          1      Lindt      Lindt          1        3.6 
      2401          5          2  Hershey's     Godiva        0.8        1.5


Next we will specify the *columns* input as in our previous examples:    
    
::
    
    // Column names that will be split
    columns = "Brand1" $| "Brand2" $| "Price1" $| "Price2";


Our *names_to* input will be a little different this time, however.

::
    
    names_to = ".value" $| "Index";
    values_to = "";

Since we will be splitting the variable names into 2 pieces (i.e. *Brand1* -> *Brand* *1*), we need to set one element of *names_to* for each of the pieces from the split variable name. 


**.value**

The first element is *".value"*. This tells :func:`dflonger` to take the first piece of the variable name (*Brand* or *Price*) and create a column with the all the values from all matching columns. 

.. list-table::
    :widths: 25 25 25 25
    :header-rows: 1

    * - Brand1
      - Brand2
      - Price1
      - Price2
    * - Dove
      - Godiva
      - 0.6
      - 0.7
    * - Godiva
      - Godiva
      - 2.7
      - 3.9


.. list-table::
    :widths: 50 50
    :header-rows: 1

    * - Brand
      - Price
    * - Dove
      - 0.6
    * - Godiva
      - 0.7
    * - Godiva
      - 2.7
    * - Godiva
      - 3.9

In other words, combine all the values from the variables *Brand1* and *Brand2* into a single variable named *Brand* and do the same for the *Price* columns.  

**Index**

The second element of *names_to* tells :func:`dflonger` to create a column named *Index* and fill it with the contents of the second piece of the variable names (i.e *1* or *2*). 

Since *names_to* is specifying where to send the "values", *values_to* will be empty.

**Regex to split the variable names**

Now we can set our other options using the `pivotControl` structure.

::
    
    // Declare 'pctl' to be a pivotControl structure
    // and fill with default settings
    struct pivotControl pctl;
    pctl = pivotControlCreate();

The *names_pattern_split* member of the `pivotControl` structure is where we can assign a string with a regular expression that will split the *columns* we specified earlier. A full description of regular expressions is beyond our scope here, however, the most important thing to know is that each statement inside a pair of parentheses is a group. The name will be split by group.

The first group is ``(Brand|Price)``. That will match either "Brand" or "Price". If we had several variable names and did not want to explicitly list them all, we could make our first group ``([a-zA-Z])``. That would match any upper or lower case characters.

Our second group is  ``([0-9])``. That will match any integer. 

::
    
    // Set the regex to split the variable names
    pctl.names_pattern_split = "(Brand|Price)([0-9])";


**Change output variable types**

By default the variables created from the pieces of the variable names will be categorical variables. Since the second peice of our variable, that we set to be called *Index* earlier when we set *names_to*, will be integers, we may not want it to be a categorical variable. So for this example, we will tell GAUSS to make it a numerical variable. 
::
    
    pctl.names_types = { "Index" "number" };  

**Pivot the data**

Now we can call :func:`dflonger` with the inputs we have created.

::
    
    // Convert the dataframe to long format according to our specifications
    df_long = dfLonger(df_choc, columns, names_to, values_to, pctl);

    // Print the first 5 rows of the long form dataframe
    head(df_long);

::

   Subject      Trial  Selection      Index      Brand      Price 
      2401          1          1          1       Dove        0.6 
      2401          1          1          2     Godiva        0.7 
      2401          2          2          1     Godiva        2.7 
      2401          2          2          2     Godiva        3.9 
      2401          3          2          1  Hershey's        1.7    


.. seealso:: Functions :func:`dfwider`