Chapter 4 Data Preparation

4.1 Data Set for Modeling

PoissonERM is designed for implementing standard binary-endpoint E-R analysis on multiple endpoints. The data set must include:

A column of “C,” where any non-NA value indicates dropping the row and NA value indicates including the row in modeling;
One column of unique Subject ID where the same ID indicates the same subject;
One column of Outcome where 0 indicates “not observed with event” and 1 indicates “observed with event”;
One column of Event Type (even there is only one endpoint), values could be numeric or characters;
One column of Time in days;
Columns of Exposure Metrics in original scale (as many as needed);
Columns of Covariates (as many as needed) if covariates search or demographic summary is requested;
One column of Demographic-group-by variable, which could be one of the covariates or a new variable.

The following is optional: - One column of Event Sub-Type, providing more detailed categories for “Yes” and “No” in each endpoint; - One column of Grade if the endpoints are related to Grades such as “\(\le \text{Grade} 2\).”

Below is a subset of obsdata.csv in the example:

PROT	ID	SEX	RACE	LOCATION	AGE	BWT	TIME	FLAG	DV	SUB	C
2	1	1	3	2	55	64	57	1	0	3	NA
3	2	2	1	1	26	88	193	1	0	3	NA
2	3	2	2	1	50	55	191	1	0	3	NA
2	4	1	2	1	40	53	317	1	0	3	NA
1	5	2	4	1	39	76	199	1	0	3	NA
3	6	1	1	2	55	68	111	1	0	3	NA
2	1	1	3	2	55	64	57	2	1	2	NA
3	2	2	1	1	26	88	193	2	0	6	NA
2	3	2	2	1	50	55	191	2	0	5	NA
2	4	1	2	1	40	53	317	2	0	6	NA
1	5	2	4	1	39	76	199	2	0	5	NA
3	6	1	1	2	55	68	111	2	1	1	NA

In this example,

“PROT” is the “Demographic-summary-by” column.
“ID” is the “subject ID” column, ”DV" is the “Outcome” column.
“Time” is the “Time” column in days.
“FLAG” is the “Event Type” column.
“SUB” is the “Event Sub-Type,” where the same value with different Event Type values means different event sub-type (nested within the Event Type).
“SEX,” “RACE,” “LOCATION,” “AGE,” “BWT” are the “covariates” columns. The level values in categorical variables are coded in numbers, and the actual Labels can be provided by user in the control script.
“CAVE1” and “CAVE2” are the “Exposure Metric” columns. 0 values will be replaced by 0.0001 while calculating the log-scale.

4.2 New Exposure Data Set for Prediction

The new exposure data set need to contain

A column of Group Label for summarizing the result;
Columns of simulated New Exposure Metrics.

The data set could contain any other columns.

Below is a subset of the simulated new exposures simdata.csv in the example:

GROUP	CAVE1	CAVE2	C
100 mg QD	48.26489	60.89629	1
100 mg QD	172.61230	87.19719	1
100 mg QD	58.98776	85.84614	1
100 mg QD	56.19063	109.53101	1
100 mg QD	104.41777	83.96203	1
100 mg QD	283.81537	75.03528	1
50 mg QD	48.47262	124.34171	0
50 mg QD	32.14167	43.52486	0
50 mg QD	22.93323	33.95415	0
50 mg QD	87.91701	47.09609	0
50 mg QD	57.66175	41.51849	0
50 mg QD	42.05427	46.84531	0
30 mg QD	28.30073	16.79537	0
30 mg QD	28.31021	30.03572	0
30 mg QD	37.09448	35.19235	0
30 mg QD	34.85409	44.31023	0
30 mg QD	17.25611	20.07292	0
30 mg QD	20.40230	23.39462	0

In this example, each group has 2500 simulated exposures. In practice, the simulated exposures should be a value corresponding to 52 weeks/1 year of the treatment. For the columns in this example:

“GROUP” is the “Group Label” column. It could be coded in numbers or characters. Levels and Labels can be specified by user in the control script.
“CAVE1” and “CAVE2” are the simulated “New Exposures Metrics.” The exposure column names are not required to be the same as the modeling data set. User can provide the correspondence in the control script.
“C” is the additional column for filtering the data set. All 100 mg QD rows are with C = 1. User may hide/only show the 100 mg QD group in the prediction result without modifying data set outside.

PROT	ID	SEX	RACE	LOCATION	AGE	BWT	TIME	FLAG	DV	SUB	C
2	1	1	3	2	55	64	57	1	0	3	NA
3	2	2	1	1	26	88	193	1	0	3	NA
2	3	2	2	1	50	55	191	1	0	3	NA
2	4	1	2	1	40	53	317	1	0	3	NA
1	5	2	4	1	39	76	199	1	0	3	NA
3	6	1	1	2	55	68	111	1	0	3	NA
2	1	1	3	2	55	64	57	2	1	2	NA
3	2	2	1	1	26	88	193	2	0	6	NA
2	3	2	2	1	50	55	191	2	0	5	NA
2	4	1	2	1	40	53	317	2	0	6	NA
1	5	2	4	1	39	76	199	2	0	5	NA
3	6	1	1	2	55	68	111	2	1	1	NA

PROT	ID	SEX	RACE	LOCATION	AGE	BWT	TIME	FLAG	DV	SUB	C
2	1	1	3	2	55	64	57	1	0	3	NA
3	2	2	1	1	26	88	193	1	0	3	NA
2	3	2	2	1	50	55	191	1	0	3	NA
2	4	1	2	1	40	53	317	1	0	3	NA
1	5	2	4	1	39	76	199	1	0	3	NA
3	6	1	1	2	55	68	111	1	0	3	NA
2	1	1	3	2	55	64	57	2	1	2	NA
3	2	2	1	1	26	88	193	2	0	6	NA
2	3	2	2	1	50	55	191	2	0	5	NA
2	4	1	2	1	40	53	317	2	0	6	NA
1	5	2	4	1	39	76	199	2	0	5	NA
3	6	1	1	2	55	68	111	2	1	1	NA

PROT	ID	SEX	RACE	LOCATION	AGE	BWT	TIME	FLAG	DV	SUB	C
2	1	1	3	2	55	64	57	1	0	3	NA
3	2	2	1	1	26	88	193	1	0	3	NA
2	3	2	2	1	50	55	191	1	0	3	NA
2	4	1	2	1	40	53	317	1	0	3	NA
1	5	2	4	1	39	76	199	1	0	3	NA
3	6	1	1	2	55	68	111	1	0	3	NA
2	1	1	3	2	55	64	57	2	1	2	NA
3	2	2	1	1	26	88	193	2	0	6	NA
2	3	2	2	1	50	55	191	2	0	5	NA
2	4	1	2	1	40	53	317	2	0	6	NA
1	5	2	4	1	39	76	199	2	0	5	NA
3	6	1	1	2	55	68	111	2	1	1	NA