Chapter 4 Data Preparation
4.1 Data Set for Modeling
PoissonERM
is designed for implementing standard binary-endpoint E-R analysis on multiple endpoints. The data set must include:
- A column of “C,” where any non-NA value indicates dropping the row and NA value indicates including the row in modeling;
- One column of unique Subject ID where the same ID indicates the same subject;
- One column of Outcome where 0 indicates “not observed with event” and 1 indicates “observed with event”;
- One column of Event Type (even there is only one endpoint), values could be numeric or characters;
- One column of Time in days;
- Columns of Exposure Metrics in original scale (as many as needed);
- Columns of Covariates (as many as needed) if covariates search or demographic summary is requested;
- One column of Demographic-group-by variable, which could be one of the covariates or a new variable.
The following is optional: - One column of Event Sub-Type, providing more detailed categories for “Yes” and “No” in each endpoint; - One column of Grade if the endpoints are related to Grades such as “\(\le \text{Grade} 2\).”
Below is a subset of obsdata.csv in the example:
PROT | ID | SEX | RACE | LOCATION | AGE | BWT | TIME | CAVE1 | CAVE2 | FLAG | DV | SUB | C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 1 | 3 | 2 | 55 | 64 | 57 | 0 | 0 | 1 | 0 | 3 | NA |
3 | 2 | 2 | 1 | 1 | 26 | 88 | 193 | 0 | 0 | 1 | 0 | 3 | NA |
2 | 3 | 2 | 2 | 1 | 50 | 55 | 191 | 0 | 0 | 1 | 0 | 3 | NA |
2 | 4 | 1 | 2 | 1 | 40 | 53 | 317 | 0 | 0 | 1 | 0 | 3 | NA |
1 | 5 | 2 | 4 | 1 | 39 | 76 | 199 | 0 | 0 | 1 | 0 | 3 | NA |
3 | 6 | 1 | 1 | 2 | 55 | 68 | 111 | 0 | 0 | 1 | 0 | 3 | NA |
2 | 1 | 1 | 3 | 2 | 55 | 64 | 57 | 0 | 0 | 2 | 1 | 2 | NA |
3 | 2 | 2 | 1 | 1 | 26 | 88 | 193 | 0 | 0 | 2 | 0 | 6 | NA |
2 | 3 | 2 | 2 | 1 | 50 | 55 | 191 | 0 | 0 | 2 | 0 | 5 | NA |
2 | 4 | 1 | 2 | 1 | 40 | 53 | 317 | 0 | 0 | 2 | 0 | 6 | NA |
1 | 5 | 2 | 4 | 1 | 39 | 76 | 199 | 0 | 0 | 2 | 0 | 5 | NA |
3 | 6 | 1 | 1 | 2 | 55 | 68 | 111 | 0 | 0 | 2 | 1 | 1 | NA |
In this example,
- “PROT” is the “Demographic-summary-by” column.
- “ID” is the “subject ID” column, ”DV" is the “Outcome” column.
- “Time” is the “Time” column in days.
- “FLAG” is the “Event Type” column.
- “SUB” is the “Event Sub-Type,” where the same value with different Event Type values means different event sub-type (nested within the Event Type).
- “SEX,” “RACE,” “LOCATION,” “AGE,” “BWT” are the “covariates” columns. The level values in categorical variables are coded in numbers, and the actual Labels can be provided by user in the control script.
- “CAVE1” and “CAVE2” are the “Exposure Metric” columns. 0 values will be replaced by 0.0001 while calculating the log-scale.
4.2 New Exposure Data Set for Prediction
The new exposure data set need to contain
- A column of Group Label for summarizing the result;
- Columns of simulated New Exposure Metrics.
The data set could contain any other columns.
Below is a subset of the simulated new exposures simdata.csv in the example:
GROUP | CAVE1 | CAVE2 | C |
---|---|---|---|
100 mg QD | 48.26489 | 60.89629 | 1 |
100 mg QD | 172.61230 | 87.19719 | 1 |
100 mg QD | 58.98776 | 85.84614 | 1 |
100 mg QD | 56.19063 | 109.53101 | 1 |
100 mg QD | 104.41777 | 83.96203 | 1 |
100 mg QD | 283.81537 | 75.03528 | 1 |
50 mg QD | 48.47262 | 124.34171 | 0 |
50 mg QD | 32.14167 | 43.52486 | 0 |
50 mg QD | 22.93323 | 33.95415 | 0 |
50 mg QD | 87.91701 | 47.09609 | 0 |
50 mg QD | 57.66175 | 41.51849 | 0 |
50 mg QD | 42.05427 | 46.84531 | 0 |
30 mg QD | 28.30073 | 16.79537 | 0 |
30 mg QD | 28.31021 | 30.03572 | 0 |
30 mg QD | 37.09448 | 35.19235 | 0 |
30 mg QD | 34.85409 | 44.31023 | 0 |
30 mg QD | 17.25611 | 20.07292 | 0 |
30 mg QD | 20.40230 | 23.39462 | 0 |
In this example, each group has 2500 simulated exposures. In practice, the simulated exposures should be a value corresponding to 52 weeks/1 year of the treatment. For the columns in this example:
- “GROUP” is the “Group Label” column. It could be coded in numbers or characters. Levels and Labels can be specified by user in the control script.
- “CAVE1” and “CAVE2” are the simulated “New Exposures Metrics.” The exposure column names are not required to be the same as the modeling data set. User can provide the correspondence in the control script.
- “C” is the additional column for filtering the data set. All 100 mg QD rows are with C = 1. User may hide/only show the 100 mg QD group in the prediction result without modifying data set outside.