Save and deploy models via DeployLasso, DeployRandomForest, or DeployLinearMixedModel
What is this?
These classes let one save and deploy custom models on varied datasets via the following workflow:
- Using the model development functions, you found a model that performs well.
- Now, you train and save model on your entire dataset (with the
useSavedModelargument set to FALSE). - Next, flip the
useSavedModelargument to TRUE and rerun the script however often you need to generate new predictions.- Now that you're using a saved model, you're just running new people/encounters against the saved model to generate predictions.
- Retrain the model whenever significant changes occur with the data (perhaps quarterly) by flipping the
useSavedModelto FALSE and go to step 3.
One can do both classification (ie, predict Y or N) as well as regression (ie, predict a numeric field).
Is any dataset ready for model creation and deployment?
Nope. It'll help if you can follow these guidelines:
- Don't use 0 or 1 for the independent variable when doing classification. Use Y/N instead. The IIF function in T-SQL may help here.
- Create a column thath as
Yfor those rows in the training set andNfor those rows in the test set. Think of the test set as those people or enounters that need a prediction. This column can be called InTestWindow. - Unlike the development step (which you should have already completed), you should now pull in both training and test rows in your query.
- One has to create a table to receive the predicted values. You can work in SSMS (or SAMD, for those using Health Catalyst products):
- Create these tables when doing classification or regression, respectively:
CREATE TABLE [SAM].[dbo].[HCRDeployClassificationBASE] (
[BindingID] [int] ,
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0),
[PredictedProbNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255)
)
CREATE TABLE [SAM].[dbo].[HCRDeployRegressionBASE] (
[BindingID] [int],
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0),
[PredictedValueNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255)
)
How can I improve my model performance?
Note these preprocessing steps should first be tested and found useful in the development step.
- If you have lots of NULL values, you may want to turn on imputation via the
imputeargument (see below). - If you have lots of NULL cells and your data is longitudinal, you may want to try GroupedLOCF.
- If you think the phenomenon you're trying to predict has a seasonal or diurnal component, you may need some feature engineering.
- If your data is longitudinal, you may want to try the `LinearMixedModelDeployment`` (detailed below).
Step 1: Pull in the data via selectData
-
Return: a data frame that represents your data.
-
Arguments:
- server: a server name. You'll pull data from this server.
- database: a database name. You'll pull data from this database.
library(healthcareai)
connection.string = "
driver={SQL Server};
server=localhost;
database=SAM;
trusted_connection=true
"
query = "
SELECT
[OrganizationLevel]
,[InTestWindowFLG]
,[MaritalStatus]
,[Gender]
,IIF([SalariedFlag]=0,'N','Y') AS SalariedFlag
,[VacationHours]
,[SickLeaveHours]
FROM [AdventureWorks2012].[HumanResources].[Employee]
"
df <- selectData(connection.string, query)
head(df)
str(df)
Note: if you want a CSV example (ie, an example that you can run as-is), see the built-in docs:
library(healthcareai)
?healthcareai
Step 2: Set your parameters via SupervisedModelParameters
- Return: an object representing your specific configuration.
- Arguments:
- df: a data frame. The data your model is based on.
- type: a string. This will either be 'classification' or 'regression'.
- impute: a boolean, defaults to FALSE. Whether to impute by replacing NULLs with column mean (for numeric columns) or column mode (for categorical columns).
- grainCol: a string, defaults to None. Name of possible GrainID column in your dataset. If specified, this column will be removed, as it won't help the algorithm.
- testWindowCol: a string. Name of utility column used to indicate whether rows are in train or test set. Recall that test set receives predictions.
- predictedCol: a string. Name of variable (or column) that you want to predict.
- debug: a boolean, defaults to FALSE. If TRUE, console output when comparing models is verbose for easier debugging.
- useSavedModel: a boolean, defaults to FALSE. If TRUE, use the model that has been saved to disk in the current working directory (WD). If FALSE, save a new model to disk in the current WD. Use
getwd()in the console to check WD. - cores: an int, defaults to 4. Number of cores on machine to use for model training.
- sqlConn: a string. Specifies the driver, server, database, and whether you're using a trusted connection (which is preferred).
- destSchemaTable : a string. Denotes the output schema and table (separated by a period) where the predictions should be pushed.
p <- DeploySupervisedModelParameters$new()
p$df = df
p$type = 'classification'
p$impute = TRUE
p$grainCol = 'GrainID'
p$testWindowCol = 'InTestWindow'
p$predictedCol = 'SalariedFlag'
p$debug = FALSE
p$useSavedModel = FALSE
p$cores = 1
p$sqlConn = connection.string
p$destSchemaTable = 'dbo.HCRDeployClassificationBASE'
Step 3: Create the models via the DeployLasso or DeployRandomForest algorithms.
# Run Lasso (if that's what performed best in the develop step)
dL <- LassoDeployment$new(p)
dL$deploy()
# Or run RandomForest (if that's what performed best in the develop step)
dL <- RandomForestDeployment$new(p)
dL$deploy()
# Or run Linear Mixed Model (if that's what performed best in the develop step)
p$personCol = 'PatientID' # Change to your PatientID col
lMM <- LinearMixedModelDeployment$new(p)
lMM$deploy()
Full example code
ptm <- proc.time()
library(healthcareai)
connection.string = "
driver={SQL Server};
server=localhost;
database=AdventureWorks2012;
trusted_connection=true
"
query = "
SELECT
[PatientEncounterID]
,[PatientID]
,[SystolicBPNBR]
,[LDLNBR]
,[A1CNBR]
,[GenderFLG]
,[ThirtyDayReadmitFLG]
FROM [SAM].[dbo].[DiabetesClinical]
"
df <- selectData(connection.string, query)
head(df)
# Remove unnecessary columns
df$PatientID <- NULL
p <- SupervisedModelDeploymentParams$new()
p$type = 'classification'
p$df = df
p$grainCol = 'PatientEncounterID'
p$testWindowCol = 'InTestWindowFLG'
p$predictedCol = 'ThirtyDayReadmitFLG'
p$impute = TRUE
p$debug = FALSE
p$useSavedModel = FALSE
p$cores = 1
p$sqlConn = connection.string
p$destSchemaTable = 'dbo.HCRDeployClassificationBASE'
# If Lasso was more accurate in the dev step
dL <- LassoDeployment$new(p)
dL$deploy()
# If Random Forest was more accurate in the dev step
#dL <- RandomForestDeployment$new(p)
#dL$deploy()
print(proc.time() - ptm)
Relevant example code:
p <- SupervisedModelParameters$new()
p$df = df
p$type = 'classification'
p$impute = TRUE
p$grainCol = 'PatientEncounterID' # This grain of the dataset (required)
p$personCol = 'PatientID' # This represents the person (required)
p$predictedCol = 'HighA1C'
p$debug = FALSE
p$cores = 1
lMM <- LinearMixedModelDeployment$new(p)
lMM$deploy()
Note: if you want a CSV example (ie, an example that you can run as-is), see the built-in docs:
library(healthcareai)
?healthcareai