Chapter 5 Control Script for Modeling: `user-input.r`

There are many modeling options that are specified in the control scripts to see how modeling result may change. The options will all be presented in the context of the example dataset as that is considered to be the easiest way to present all the options in a clear manner.

The example analysis has the following components which are specified in the user-input.r control file.

2 Exposure Metrics, 3 Categorical Covariates and 2 Continuous Covariates were considered in all 3 Endpoints;
All exposures, covariates and events were summarized by protocol number (“PROT”).
Threshold of low incidence rate was 10% therefore only 2 endpoints were considered in this analysis (Adverse Event 1 was with incidence rate lower than 10%).
Event sub-type was provided for all endpoints, which breaks down the “observed with event” and “not observed with event” outcomes into more detailed classification.
Included covariates in final model if there is any proper one(s).
Considered log- and sqrt-transformation for exposure metrics.
Considered log-transformation for continuous covariates.
No reference value for continuous covariates.
Exposure selection conducted following \(p\)-value significance criteria, and backwards deletion did not remove exposure metric regardless of significance. Final model may not contain exposure metric if none meet the exposure selection criteria.
Tables are all saved as .tex (LaTex format).

This section explains how to write a user-input.r control file for modeling. The examples showing below can be used as a template by combining all chunks in order(user needs to update the column names and options accordingly).

The control script require package tidyverse.

library(tidyverse)

5.1 Data set

input.data.name is the data file for modeling. This data file must be under the directory of pathRunType.

input.data.name <- "obsdata.csv"

5.2 Unique Patient Identifier

pat.num is the column name of the subject ID variable.

pat.num <- "ID" #ID is the column name in data set

5.3 Time

EVDUR and EVDUR.unit are the time column name in the data set and the time unit. The time values must be in days.

#Column for time in days
EVDUR <- "TIME"
EVDUR.unit <- "days"

5.4 Endpoints to be analysed

This section introduces the complex set up regarding the endpoints.

Outcome column and levels&labels:

dv <- "DV" #DV is the outcome column name in data set
# Levels as they appear in the dataset
dv.levels <- c(0, 1)
# Labels you'd like to see in the report
dv.labels <- c("No", "Yes")

Event Type (Endpoint flag) column

# Column describing endpoints
endpcolName <- "FLAG" #FLAG is the flag column name in data set
# Subset of endpoints in analysis dataset to be analyzed
# Can be numbers or character strings, however they show up in the dataset
endpoints <- c(1,2,3) #the actual values in FLAG to be used in this analysis
#endpoints <- c(1,3) #FLAG=2 will be ignored in this analysis

Event Names set up. Provide the name of each endpoint.

# ENDPOINTS names.
# Names for endpoints as they should appear in tables and figures.
# Should be the same length as the "endpoints" vector above.
# Any endpoint name cannot be a sub-string of the other one
# Bad example
#endpName <- c("Adverse",
#              "Adverse Event",
#              "Adverse Event Type")

# EXAMPLE:
endpName <- c("Adverse Event Type 1",
              "Adverse Event Type 2",
              "Adverse Event Type 3")

endpName <- sapply(X = endpName, simplify = F, USE.NAMES = T, FUN = function(n){
  if(n == "Adverse Event Type 1"){
    numb <- endpoints[1] #FLAG = 1 (endpoints[1]) is "Adverse Event Type 1"
    label <- n
  }else if(n == "Adverse Event Type 2"){
    numb <- endpoints[2] #FLAG = 2 (endpoints[2]) is "Adverse Event Type 2"
    label <- n
  }else if(n == "Adverse Event Type 3"){
    numb <- endpoints[3] #FLAG = 3 (endpoints[3]) is "Adverse Event Type 3"
    label <- n
  }else{
    numb <- 0
    label <- 0
  }
  endp.list <- c(numb, label)
  return(endp.list)
 } # function(n)
) # sapply

5.5 Event Sub-Type Column and Set up (optional).

This section introduces the set up for event sub-type information.

sub.endpcolName is the column name of the sub-type information, and sub.endpName connects the sub-type values to the endpoints names with proper sub-type names.

In the example obsdata.csv, the values in column “SUB” we’re all in numbers, and they refer to different sub-type names for different endpoint types.

sub.endpcolName and sub.endpName could be missing.

sub.endpcolName <- "SUB" #SUB is the column name in the data set
sub.endpName <- list(`Adverse Event Type 1` = c(`Severe` = 1, 
                                                `Mild` = 2,
                                                `None` = 3),
                     #Endpoint name = c(Actual sub-category name = Value in the column)
                     `Adverse Event Type 2` = c(`SubType 1` = 1,
                                                `SubType 2` = 2,
                                                `SubType 3` = 3,
                                                `SubType 4` = 4,
                                                `SubType 5` = 5,
                                                `SubType 6` = 6),
                     `Adverse Event Type 3` = c(`Severe` = 1,
                                                `Moderate` = 2,
                                                `Mild` = 3,
                                                `None` = 4))

5.6 Grade column and Set up (optional).

If the endpoints are grade related, provide the grade column name as dvg and a summary of grades within each endpoint will be generated.

#The same column cannot be set as dvg and sub.endpcolName at the same time
dvg <- "SUB" #SUB is the column name in the data set

dvg could be missing.

Usually,dvg and sub.endpcolName are not both used in the same analysis.

5.7 Exposure Metrics

This section is about the exposure metric information. Beside of the column names orig.exposureCov, user needs to provide the names of each exposure metric to show on table/figure or in the report context, as well as the endpoint(s) to use each exposure with.

Provide Exposure Metric Columns

# EXPOSURE covariates - list
# Names should correspond to names in analysis dataset.
# EXAMPLE:
orig.exposureCov <- c("CAVE1","CAVE2") #exposure metric column names in data

Set up Exposure Metrics

desc.exposureCov.1 <- sapply(X = orig.exposureCov, simplify = F, USE.NAMES = T, FUN = function(f){
  if(f == "CAVE1"){ 
    #name to show in tables or figures
    title = "C[ave1]~(ng/mL)" #use ~ for white space; use [x] for subscript text
    label = "Time-weighted average concentration A" #Name to show in report context
    #end.p = c(endpoints[1]) #will be used in the first endpoint
    end.p = c(endpoints[1:3]) #will be used in all 3 endpoints
  }else if(f == "CAVE2"){
    title = "C[ave2]~(ng/mL)"
    label = "Time-weighted average concentration B"
    end.p = c(endpoints[1:3])
  }else{

  }
  list(title = title, label = label, end.p = end.p)
 } # function(f)
) # sapply

5.8 Categorical Covariates

The set up for categorical covariates is very similar to exposure metrics. Categorical covariates will have the same name on table/figure and in the report context.

Categorical covariates can be provided as a summary-only variable, which will only show in the demographic summary and will be excluded in modeling covariates search.

Categorical Covariates Column Names

# CATEGORICAL covariates - list
# Names should correspond to names in analysis dataset.
# EXAMPLE:
full.cat <- c("RACE","SEX", "LOCATION")

Set Up Categorical Covariates

# Assign more elaborate names and factor levels to categorical covs
# NOTE: Assign the desired reference value as the first element of "levels"
# EXAMPLE:
 full.cat.1 <- sapply(X = full.cat, simplify = F, USE.NAMES = T, FUN = function(j){
   if(j == "RACE"){
     #categorical values are coded as numbers 1,2,3,4 in data set
     #the levels vector is corresponding to the order of numerical levels in the data set
     levels = c("White", "Black", "Asian", "Other") #1="White", 2="Black", ...
     label = "Race" #Name of the variable to show in report
     #end.p = c(endpoints[1]) #use for the first endpoint
     #end.p = "summary only" #use for demographic summary only, won't be included in modeling
     end.p = c(endpoints[1:3]) #use for all endpoints
   }else if(j == "SEX"){
     levels = c("Male", "Female")
     label = "Sex"
     end.p = c(endpoints[1:3])
   }else if(j == "LOCATION"){
     levels = c("US", "Non-US")
     label = "Geographical~Location" #use ~ for white space
     end.p = c(endpoints[1:3])
   }
   list(levels = levels, label = label, end.p = end.p)
  } # function(j)
 ) # sapply

5.9 Continuous Covariates

The set up for continuous covariates is very similar to exposure metrics. Continuous covariates will have the same name on table/figure and in the report context.

Continuous covariates can be provided as a summary-only variable, which will only show in the demographic summary and will be excluded in modeling covariates search.

Continuous Covariates Column Names

# CONTINUOUS covariates - list
# Names should correspond to names in analysis dataset.
# EXAMPLE:
orig.con <- c("AGE","BWT")

Set Up Continuous Covariates

# Assign more elaborate names to continuous covs
# Write what you want to end up in the plot(s)
# EXAMPLE:
orig.con.1 <- sapply(X = orig.con, simplify = F, USE.NAMES = T, FUN = function(k){
  if(k == "AGE"){
    title = "age" #Name of the variable to show in report
    #end.p = c(endpoints[1]) #use for first endpoint
    #end.p = "summary only" #use for demographic summary only, won't be included in modeling
    end.p = c(endpoints[1:3])
  }else if(k == "BWT"){
    title = "Baseline~Bodyweight~(kg)" #use ~ for white space
    end.p = c(endpoints[1:3])
  }else{

  }
  list(title = title, end.p = end.p)
 } # function(k)
) # sapply

5.10 Reference Value for Continuous Covariates

It is possible to use reference value for continuous covariates in the modeling. The default value of con.model.ref is “No.”

Use Reference Value or Not

#Use (continuous variable - reference) in model
# con.model.ref <- "Yes"
#Not use (continuous variable - reference) in model
con.model.ref <- "No" #Default is "No"

Provide Reference Value (Optional)

#The reference number of each continuous variable
#If not provided, will use median by default
#con.ref<- list(ref = c(AGE = 35, #variable name = reference value
#                       BWT = 70),
#               #already.adjusted.in.data is always F, unless the values are already centered in the data set 
#               already.adjusted.in.data = F)

5.11 Other Options

Most of the options in this section have a numeric value or they are binary-indicator in TRUE/FALSE or “Yes”/“No.”

5.11.1 Demographic Summary-by Variable

demog_grp_var can be any categorical variable in the data set. The default value is “PROT.” Error occurs if demog_grp_var is missing and “PROT” is not a column in the data set.

# Grouping variable for demographic summaries
demog_grp_var <- "PROT" #Default is "PROT"
# demog_grp_var <- "SEX"

5.11.2 Additional Columns to Include In Modeling Dataset

The saved modeling data set will only contains necessary columns (ID, Exposure Metrics, Event-related, Endpoint-related, Covariates, Additional Columns specified by use).

###Additional columns to carry with
###Could be missing
add_col <- c("DOSE")

5.11.3 Threshold(%) for Incidence Rate

The endpoint will an incidence rate lower than the threshold will be ignored in the analysis.

# threshold percentage for considering endpoints
p.yes.low <- 5 #Default value 10, must be between 0 and 100

5.11.4 Exposure Metric Selection Criteria

Exposure Metric is selected for each endpoint following the same criteria chosen by the user. useDeltaD and p_val are key values.

Select the exposure metric that satisfies the significant level

# use p-value instead of Delta D
 useDeltaD <- FALSE #Default value FALSE
# significant level to an exposure metric to stay in base model
 p_val <- 0.01 #Default significant level 0.01

If multiple exposure metrics meet the significant level, the one with the smallest \(p\)-value will be selected. If no exposure metric meets the significant level, the base model will not contain exposure metric.

Or, select the exposure metric with largest change in Deviance \(\Delta D\) regardless of significant level

# use Delta D instead of p-value 
useDeltaD <- TRUE #Default value FALSE

If useDeltaD and p_val are missing in the control script, exposure metric will be selected as the one satisfies significant level 0.01 by default.

5.11.5 Covariates Search

# Looking for covariates in modeling
analyze_covs <- "Yes" #Default value "Yes" if at least one covariate provided; otherwise "No"

# threshold(%) for missing value proportion in continuous covariates
# the variable will be ignored if the missing value is higher than the threshold
p.icon <- 10 #Default value 10, must be between 0 and 100

# threshold(%) for missing value proportion in categorical covariates
# the variable will be ignored if the missing value is higher than the threshold
p.icat <- 10 #Default value 10, must be between 0 and 100

5.11.6 Backwards Deletion Criteria

Final model is obtained via backwards deletion with the significant level specified by the use.

#Significant level for a variable to stay in the final model
 p_val_b <- 0.01 #Default significant level 0.01
#Exposure Metric will be in the final model regardless of the significant level. 
 exclude.exp.met.bd <- "Yes" #Default value "Yes"

If p_val_b and exclude.exp.met.bd, backwards deletion will be conducted at significant level 0.01 by default and the selected exposure metric (if any) will not be removed by default. p_val_b needs to be no lower than 0.0001.

5.11.7 Exposure Metrics and Continuous Covariates Scales

If user chooses to consider scale transformation for exposure metrics or for continuous covariates, the transformed scales of variables (if any) will be compared with the original scales and only one scale of each variable could be included in the data set.

For the exposure metrics, each scale of each metric will be assessed using a univariate model and the one that meets the exposure metric selection criteria will retain in the model.

For the continuous covariates, the one with better normality will be selected from the original scale and the log-transformed scale. If the number of subjects is between 3 and 5000, Shapiro-Wilk’s normal test will be used; otherwise, Anderson-Darling will be used.

# Taking log of exposure metrics
log_exp <- "Yes" #Default value "Yes"
# Taking sqrt of exposure metrics
sqrt_exp <- "Yes" #Default value "Yes"
# Taking log of continuous covariates
log_covs <- "Yes" #Default value "Yes"

In log-transformation, values of zero will be replaced with 0.0001.

5.11.8 Odds-Ratio Results

Odds-Ration result of categorical covariates will be generated and included in the report by default.

To include continuous Covariates, user need to specify a vector of percentiles OR_con_perc where the odds-ratio of the change from median/reference value to each percentile will be shown in tables and figures. For example, the odds-ratio of Age will be shown as the odds-ratio of a decrease in Age from the median Age to 25th percentile in Age and an increase in Age from the median Age to 75th percentile in Age, instead of the odds-ratio of 1 unit increase in Age.

#Odds Ratios plot for Continuous Variable
#Compare from low-th percentile to up-th percentile
#If any of them not specified, continuous variable will be removed from OR plots
OR_con_perc <- c(0.25,0.75) #values must be between 0 and 1

#Include Tables in the report
OR_tab <- "No" #Default value "Yes"

#Include Figures in the report
OR_fig <- "Yes" #Default value "Yes"

5.11.9 Saved table format

Tables can be saved in LaTex format (.tex) or in Tab-separated format (.tsv). The LaTex format is easy to be imported or inserted to a LaTex file, while the Tab-separated is easy to be copy&paste in Excel then to be used in Word document. Either format can be used in the auto-generated report via ReportPoisson().

LaTex.table <- TRUE #Default value FALSE

Chapter 5 Control Script for Modeling: user-input.r