Chapter 4 Data Preparation

4.1 Data Set for Modeling

PoissonERM is designed for implementing standard binary-endpoint E-R analysis on multiple endpoints. The data set must include:

  • A column of “C,” where any non-NA value indicates dropping the row and NA value indicates including the row in modeling;
  • One column of unique Subject ID where the same ID indicates the same subject;
  • One column of Outcome where 0 indicates “not observed with event” and 1 indicates “observed with event”;
  • One column of Event Type (even there is only one endpoint), values could be numeric or characters;
  • One column of Time in days;
  • Columns of Exposure Metrics in original scale (as many as needed);
  • Columns of Covariates (as many as needed) if covariates search or demographic summary is requested;
  • One column of Demographic-group-by variable, which could be one of the covariates or a new variable.

The following is optional: - One column of Event Sub-Type, providing more detailed categories for “Yes” and “No” in each endpoint; - One column of Grade if the endpoints are related to Grades such as “\(\le \text{Grade} 2\).”

Below is a subset of obsdata.csv in the example:

PROT ID SEX RACE LOCATION AGE BWT TIME CAVE1 CAVE2 FLAG DV SUB C
2 1 1 3 2 55 64 57 0 0 1 0 3 NA
3 2 2 1 1 26 88 193 0 0 1 0 3 NA
2 3 2 2 1 50 55 191 0 0 1 0 3 NA
2 4 1 2 1 40 53 317 0 0 1 0 3 NA
1 5 2 4 1 39 76 199 0 0 1 0 3 NA
3 6 1 1 2 55 68 111 0 0 1 0 3 NA
2 1 1 3 2 55 64 57 0 0 2 1 2 NA
3 2 2 1 1 26 88 193 0 0 2 0 6 NA
2 3 2 2 1 50 55 191 0 0 2 0 5 NA
2 4 1 2 1 40 53 317 0 0 2 0 6 NA
1 5 2 4 1 39 76 199 0 0 2 0 5 NA
3 6 1 1 2 55 68 111 0 0 2 1 1 NA

In this example,

  • “PROT” is the “Demographic-summary-by” column.
  • “ID” is the “subject ID” column, ”DV" is the “Outcome” column.
  • “Time” is the “Time” column in days.
  • “FLAG” is the “Event Type” column.
  • “SUB” is the “Event Sub-Type,” where the same value with different Event Type values means different event sub-type (nested within the Event Type).
  • “SEX,” “RACE,” “LOCATION,” “AGE,” “BWT” are the “covariates” columns. The level values in categorical variables are coded in numbers, and the actual Labels can be provided by user in the control script.
  • “CAVE1” and “CAVE2” are the “Exposure Metric” columns. 0 values will be replaced by 0.0001 while calculating the log-scale.

4.2 New Exposure Data Set for Prediction

The new exposure data set need to contain

  • A column of Group Label for summarizing the result;
  • Columns of simulated New Exposure Metrics.

The data set could contain any other columns.

Below is a subset of the simulated new exposures simdata.csv in the example:

GROUP CAVE1 CAVE2 C
100 mg QD 48.26489 60.89629 1
100 mg QD 172.61230 87.19719 1
100 mg QD 58.98776 85.84614 1
100 mg QD 56.19063 109.53101 1
100 mg QD 104.41777 83.96203 1
100 mg QD 283.81537 75.03528 1
50 mg QD 48.47262 124.34171 0
50 mg QD 32.14167 43.52486 0
50 mg QD 22.93323 33.95415 0
50 mg QD 87.91701 47.09609 0
50 mg QD 57.66175 41.51849 0
50 mg QD 42.05427 46.84531 0
30 mg QD 28.30073 16.79537 0
30 mg QD 28.31021 30.03572 0
30 mg QD 37.09448 35.19235 0
30 mg QD 34.85409 44.31023 0
30 mg QD 17.25611 20.07292 0
30 mg QD 20.40230 23.39462 0

In this example, each group has 2500 simulated exposures. In practice, the simulated exposures should be a value corresponding to 52 weeks/1 year of the treatment. For the columns in this example:

  • “GROUP” is the “Group Label” column. It could be coded in numbers or characters. Levels and Labels can be specified by user in the control script.
  • “CAVE1” and “CAVE2” are the simulated “New Exposures Metrics.” The exposure column names are not required to be the same as the modeling data set. User can provide the correspondence in the control script.
  • “C” is the additional column for filtering the data set. All 100 mg QD rows are with C = 1. User may hide/only show the 100 mg QD group in the prediction result without modifying data set outside.