Deploy models and make new predictions via DeployLasso, DeployRandomForest, or DeployLinearMixedModel

What is this?

These classes let one deploy custom models on varied datasets via the following workflow:

  1. Using the model development functions, you found a model that performs well and saved it automatically.
  2. Now, run the deploy methods however often you need to load the model and generate predictions for new people/encounters.
  3. Retrain the model whenever significant changes occur with the data (perhaps quarterly) using model development.

One can do both classification (ie, predict Y or N) as well as regression (ie, predict a numeric field).

Is any dataset ready for model creation and deployment?

Nope. It'll help if you can follow these guidelines:

  • Don't use 0 or 1 for the independent variable when doing classification. Use Y/N instead. The IIF function in T-SQL may help here.
  • Create a column called InTestWindow that has Y for those people or encounters that you want a prediction generated for and 'N' for rows to ignore.
  • Unlike the development step (which you should have already completed), you now only need to select test rows in your query.
  • Predictions on test rows can be output to a dataframe or directly to an MSSQL table. If using a table, one has to create the table to receive the predicted values. You can work in SSMS (or SAMD, for those using Health Catalyst products):
    • Create these tables when doing classification or regression, respectively:
CREATE TABLE [SAM].[dbo].[HCRDeployClassificationBASE] (
[BindingID] [int] ,
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0),
[PredictedProbNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255)
)

CREATE TABLE [SAM].[dbo].[HCRDeployRegressionBASE] (
[BindingID] [int],
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0),
[PredictedValueNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255)
)

How can I improve my model performance?

Note these preprocessing steps should first be tested and found useful in the development step.

  • If you have lots of NULL values, you may want to turn on imputation via the impute argument (see below).
  • If you have lots of NULL cells and your data is longitudinal, you may want to try GroupedLOCF.
  • If you think the phenomenon you're trying to predict has a seasonal or diurnal component, you may need some feature engineering.
  • If your data is longitudinal, you may want to try the LinearMixedModelDeployment (detailed below).

Step 1: Pull in the data via selectData

  • Return: a data frame that represents your data.

  • Arguments:

    • server: a server name. You'll pull data from this server.
    • database: a database name. You'll pull data from this database.
library(healthcareai)

connection.string = "
driver={SQL Server};
server=localhost;
database=SAM;
trusted_connection=true
"

query = "
SELECT
[OrganizationLevel]
,[InTestWindowFLG]
,[MaritalStatus]
,[Gender]
,IIF([SalariedFlag]=0,'N','Y') AS SalariedFlag
,[VacationHours]
,[SickLeaveHours]
FROM [AdventureWorks2012].[HumanResources].[Employee]
"

df <- selectData(connection.string, query)
head(df)
str(df)

Note: if you want a CSV example (ie, an example that you can run as-is), see the built-in docs:

library(healthcareai)
?healthcareai

Step 2: Set your parameters via SupervisedModelParameters

  • Return: an object representing your specific configuration.
  • Arguments:
    • df: a data frame. The data your model is based on.
    • type: a string. This will either be 'classification' or 'regression'.
    • impute: a boolean, defaults to FALSE. Whether to impute by replacing NULLs with column mean (for numeric columns) or column mode (for categorical columns).
    • grainCol: a string, defaults to None. Name of possible GrainID column in your dataset. If specified, this column will be removed, as it won't help the algorithm.
    • testWindowCol: a string. Name of utility column used to indicate whether rows are in train or test set. Recall that test set receives predictions.
    • predictedCol: a string. Name of variable (or column) that you want to predict.
    • writeToDB: a boolean, defaults to TRUE. If TRUE, predictions will be written to the destination table.
    • debug: a boolean, defaults to FALSE. If TRUE, console output when comparing models is verbose for easier debugging.
    • cores: an int, defaults to 4. Number of cores on machine to use for model training.
    • sqlConn: a string. Specifies the driver, server, database, and whether you're using a trusted connection (which is preferred).
    • destSchemaTable : a string. Denotes the output schema and table (separated by a period) where the predictions should be pushed.
p <- DeploySupervisedModelParameters$new()
p$df = df
p$type = 'classification'
p$impute = TRUE
p$grainCol = 'GrainID'
p$testWindowCol = 'InTestWindow'
p$predictedCol = 'SalariedFlag'
p$debug = FALSE
p$cores = 1
p$sqlConn = connection.string
p$destSchemaTable = 'dbo.HCRDeployClassificationBASE'

Step 3: Create the models via the DeployLasso or DeployRandomForest algorithms.

# Run Lasso (if that's what performed best in the develop step)
dL <- LassoDeployment$new(p)
dL$deploy()

# Or run RandomForest (if that's what performed best in the develop step)
dL <- RandomForestDeployment$new(p)
dL$deploy()

# Or run Linear Mixed Model (if that's what performed best in the develop step)

p$personCol = 'PatientID' # Change to your PatientID col
lMM <- LinearMixedModelDeployment$new(p)
lMM$deploy()

Full example code for SQL Server

#### Classification example using SQL Server data ####
# This example requires you to first create a table in SQL Server
# If you prefer to not use SAMD, execute this in SSMS to create output table:
# CREATE TABLE dbo.HCRDeployClassificationBASE(
#   BindingID float, BindingNM varchar(255), LastLoadDTS datetime2,
#   PatientEncounterID int, <--change to match inputID
#   PredictedProbNBR decimal(38, 2),
#   Factor1TXT varchar(255), Factor2TXT varchar(255), Factor3TXT varchar(255)
# )

## 1. Loading data and packages.
ptm <- proc.time()
library(healthcareai)

connection.string <- "
driver={SQL Server};
server=localhost;
database=SAM;
trusted_connection=true
"

query <- "
SELECT
[PatientEncounterID] --Only need one ID column for random forest
,[SystolicBPNBR]
,[LDLNBR]
,[A1CNBR]
,[GenderFLG]
,[ThirtyDayReadmitFLG]
,[InTestWindowFLG]
FROM [SAM].[dbo].[HCRDiabetesClinical]
"

df <- selectData(connection.string, query)

head(df)
str(df)

## 2. Train and save the model using DEVELOP
#' set.seed(42)
inTest <- df$InTestWindowFLG # save this for deploy
df$InTestWindowFLG <- NULL

p <- SupervisedModelDevelopmentParams$new()
p$df <- df
p$type <- "classification"
p$impute <- TRUE
p$grainCol <- "PatientEncounterID"
p$predictedCol <- "ThirtyDayReadmitFLG"
p$debug <- FALSE
p$cores <- 1

# Run RandomForest
RandomForest <- RandomForestDevelopment$new(p)
RandomForest$run()

## 3. Load saved model and use DEPLOY to generate predictions. 
df$InTestWindowFLG <- inTest # put InTestWindowFLG back in.

p2 <- SupervisedModelDeploymentParams$new()
p2$type <- "classification"
p2$df <- df
p2$grainCol <- "PatientEncounterID"
p2$testWindowCol <- "InTestWindowFLG"
p2$predictedCol <- "ThirtyDayReadmitFLG"
p2$impute <- TRUE
p2$debug <- FALSE
p2$cores <- 1
p2$sqlConn <- connection.string
p2$destSchemaTable <- "dbo.HCRDeployClassificationBASE"

# Assuming Random Forest was more accurate in the development step.
dL <- RandomForestDeployment$new(p2)
dL$deploy()

print(proc.time() - ptm)

Full example code for reading (and pushing predictions to) a CSV

Start with the arguments. You'll want to add

#### Classification Example using csv data ####
## 1. Loading data and packages.
ptm <- proc.time()
library(healthcareai)

# setwd('C:/Yourscriptlocation/Useforwardslashes') # Uncomment if using csv

# Can delete this line in your work
csvfile <- system.file("extdata", 
                    "HCRDiabetesClinical.csv", 
                    package = "healthcareai")

# Replace csvfile with 'path/file'
df <- read.csv(file = csvfile, 
            header = TRUE, 
            na.strings = c("NULL", "NA", ""))

head(df)
str(df)

## 2. Train and save the model using DEVELOP
df$PatientID <- NULL
inTest <- df$InTestWindowFLG # save this for later.
df$InTestWindowFLG <- NULL

set.seed(42)
p <- SupervisedModelDevelopmentParams$new()
p$df <- df
p$type <- "classification"
p$impute <- TRUE
p$grainCol <- "PatientEncounterID"
p$predictedCol <- "ThirtyDayReadmitFLG"
p$debug <- FALSE
p$cores <- 1

# Run RandomForest
RandomForest <- RandomForestDevelopment$new(p)
RandomForest$run()

## 3. Load saved model and use DEPLOY to generate predictions. 
df$InTestWindowFLG <- inTest
p2 <- SupervisedModelDeploymentParams$new()
p2$type <- "classification"
p2$df <- df
p2$testWindowCol <- "InTestWindowFLG"
p2$grainCol <- "PatientEncounterID"
p2$predictedCol <- "ThirtyDayReadmitFLG"
p2$impute <- TRUE
p2$debug <- FALSE
p2$cores <- 1
p2$writeToDB <- FALSE

# Assuming Random Forest was more accurate in the development step.
dL <- RandomForestDeployment$new(p2)
dL$deploy()

df <- dL$getOutDf()
# Write to CSV (or JSON, MySQL, etc) using plain R syntax
# write.csv(df,'path/predictionsfile.csv')

print(proc.time() - ptm)

Linear Mixed Model (small datasets with a longitudinal flavor)

p2 <- SupervisedModelDeploymentParams$new()
p2$type <- "classification"
p2$df <- df
p2$grainCol <- "PatientEncounterID" # Unique ID for each appointment
p$personCol <- "PatientID" # Could be multiple visits per patient
p2$testWindowCol <- "InTestWindowFLG"
p2$predictedCol <- "ThirtyDayReadmitFLG"
p2$impute <- TRUE
p2$debug <- FALSE
p2$cores <- 1
p2$sqlConn <- connection.string
p2$destSchemaTable <- "dbo.HCRDeployClassificationBASE"

dL <- LinearMixedModelDeployment$new(p2)
dL$deploy()

Note: if you need to see the built-in docs (which are always up-to-date):

library(healthcareai)
?healthcareai