{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845", "showTitle": false, "title": "" } }, "source": [ "
This Notebook adds information related to the following requirements:
Download this notebook at format ipynb here.
\n", "mlflow.sklearn
mlflow.spark
mlflow.keras
The python_function
or pyfunc
flavor of models gives a generic way of bundling models.
pyfunc
is a generic python model that can define any arbitrary logic, regardless of the libraries used to train it.Log two models to mlflow using mlflow.pyfunc.log_model
:
And use them later the same way for prediction using pyfunc.load_model()
function.
Prepare train and test sets:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "22e52c7e-16d0-4038-a862-831fcc9af0d7", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "diamonds_df = sns.load_dataset('diamonds').drop(['cut', 'color', 'clarity'], axis=1)\n", "#\n", "X_train, X_test, y_train, y_test = train_test_split(diamonds_df.drop([\"price\"], axis=1), diamonds_df[\"price\"], random_state=42)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "68a8a6ab-9cec-4179-8a52-5ca828c369e9", "showTitle": false, "title": "" } }, "source": [ "Definition of custom scikit-learn model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "81984040-0ed5-42db-b176-93ee6a54c791", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "class sklearn_model(mlflow.pyfunc.PythonModel):\n", "\n", " def __init__(self, params):\n", " \"\"\" Initialize with just the model hyperparameters \"\"\"\n", " #\n", " self.params = params\n", " self.rf_model = None\n", " self.config = None\n", " \n", " def load_context(self, context=None, config_path=None):\n", " \"\"\" When loading a pyfunc, this method runs automatically with the related\n", " context. This method is designed to perform the same functionality when\n", " run in a notebook or a downstream operation (like a REST endpoint).\n", " If the `context` object is provided, it will load the path to a config from \n", " that object (this happens with `mlflow.pyfunc.load_model()` is called).\n", " If the `config_path` argument is provided instead, it uses this argument\n", " in order to load in the config. \"\"\"\n", " #\n", " if context: # This block executes for server run\n", " config_path = context.artifacts[\"config_path\"]\n", " else: # This block executes for notebook run\n", " pass\n", "\n", " self.config = json.load(open(config_path))\n", " \n", " def preprocess_input(self, model_input):\n", " \"\"\" Return pre-processed model_input \"\"\"\n", " #\n", " # any preprocessing can be done there. For the example purpose, let's just apply a Robust Scaler\n", " from sklearn.preprocessing import RobustScaler\n", " #\n", " for c in list(model_input.columns):\n", " model_input[c] = RobustScaler().fit_transform(model_input[[c]])\n", " #\n", " return model_input\n", " \n", " def fit(self, X_train, y_train):\n", " \"\"\" Uses the same preprocessing logic to fit the model \"\"\"\n", " #\n", " from sklearn.ensemble import RandomForestRegressor\n", " #\n", " processed_model_input = self.preprocess_input(X_train)\n", " rf_model = RandomForestRegressor(**self.params)\n", " rf_model.fit(processed_model_input, y_train)\n", " #\n", " self.rf_model = rf_model\n", " \n", " def predict(self, context, model_input):\n", " \"\"\" This is the main entrance to the model in deployment systems \"\"\"\n", " #\n", " processed_model_input = self.preprocess_input(model_input.copy())\n", " return self.rf_model.predict(processed_model_input)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bac0270f-8c69-41a7-bef4-d217a1525f0b", "showTitle": false, "title": "" } }, "source": [ "Definition of the parameters for the custom scikit-learn model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "cda8a381-475b-44e7-91c9-21c7de85b578", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "params_sklearn = {\n", " \"n_estimators\": 15, \n", " \"max_depth\": 5\n", "}\n", "#\n", "# Designate a path\n", "config_path_sklearn = \"data_sklearn.json\"\n", "#\n", "# Save the results\n", "with open(config_path_sklearn, \"w\") as f:\n", " json.dump(params_sklearn, f)\n", "#\n", "# Generate an artifact object to saved\n", "# All paths to the associated values will be copied over when saving\n", "artifacts_sklearn = {\"config_path\": config_path_sklearn}" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "44f30da5-8646-4c98-812b-3e0e926dde4a", "showTitle": false, "title": "" } }, "source": [ "Instantiate the scikit-learn custom model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "23e4dd34-2cf3-4d3e-988f-e858078a41ee", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[67]: {'n_estimators': 15, 'max_depth': 5}" ] } ], "source": [ "model_sk = sklearn_model(params_sklearn)\n", "#\n", "model_sk.load_context(config_path=config_path_sklearn) \n", "#\n", "# Confirm the config has loaded\n", "model_sk.config" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "35096555-48d2-43c4-b2b1-10d53684245a", "showTitle": false, "title": "" } }, "source": [ "Train the scikit-learn custom model on training set:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "e136346b-8298-4267-9b8e-39f6cedcf04b", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "model_sk.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "a1ac131d-d420-45a5-af3a-80253cdd55cc", "showTitle": false, "title": "" } }, "source": [ "Verify there can be predictions on test set using scikit-learn custom model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "88b04fee-e220-42bb-abee-13754e948e73", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", " | actual prices | \n", "predictions | \n", "
---|---|---|
0 | \n", "559 | \n", "581.063801 | \n", "
1 | \n", "2201 | \n", "1898.067468 | \n", "
2 | \n", "1238 | \n", "987.555592 | \n", "
3 | \n", "1304 | \n", "1026.659660 | \n", "
4 | \n", "6901 | \n", "10610.572961 | \n", "
\n | actual prices | \npredictions | \n
---|---|---|
0 | \n559 | \n581.063801 | \n
1 | \n2201 | \n1898.067468 | \n
2 | \n1238 | \n987.555592 | \n
3 | \n1304 | \n1026.659660 | \n
4 | \n6901 | \n10610.572961 | \n
Optionally, prepare model signature:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "1e5bbfd7-479a-4439-8faa-11330fba5c14", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[75]: inputs: \n", " ['carat': double, 'depth': double, 'table': double, 'x': double, 'y': double, 'z': double]\n", "outputs: \n", " [Tensor('float64', (-1,))]" ] } ], "source": [ "signature_sklearn = infer_signature(X_test, predictions_sklearn)\n", "signature_sklearn" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "996f35a2-9a72-484c-a088-b5e467a4d6be", "showTitle": false, "title": "" } }, "source": [ "Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use mlflow.sklearn, we automatically log the appropriate version of sklearn. With a pyfunc, we must manually construct our deployment environment. See more details about it in this video." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "76590fe1-6670-412d-a3b2-33f3c63bd939", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[77]: {'channels': ['defaults'],\n", " 'dependencies': ['python=3.9.5',\n", " 'pip',\n", " {'pip': ['mlflow', 'scikit-learn==0.24.2']}],\n", " 'name': 'sklearn_env'}" ] } ], "source": [ "conda_env_sklearn = {\n", " \"channels\": [\"defaults\"],\n", " \"dependencies\": [\n", " f\"python={version_info.major}.{version_info.minor}.{version_info.micro}\",\n", " \"pip\",\n", " {\"pip\": [\"mlflow\",\n", " f\"scikit-learn=={sklearn.__version__}\"]\n", " },\n", " ],\n", " \"name\": \"sklearn_env\"\n", "}\n", "conda_env_sklearn" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bac0a6cd-3e4e-4d96-a038-510a14f01f31", "showTitle": false, "title": "" } }, "source": [ "Save the model using mlflow.pyfunc.log_model
using the parameters defined previously:
It is now possible to load the logged model using mlflow.pyfunc.load_model
and use it for predictions:
\n", " | actual prices | \n", "predictions | \n", "
---|---|---|
0 | \n", "559 | \n", "581.063801 | \n", "
1 | \n", "2201 | \n", "1898.067468 | \n", "
2 | \n", "1238 | \n", "987.555592 | \n", "
3 | \n", "1304 | \n", "1026.659660 | \n", "
4 | \n", "6901 | \n", "10610.572961 | \n", "
\n | actual prices | \npredictions | \n
---|---|---|
0 | \n559 | \n581.063801 | \n
1 | \n2201 | \n1898.067468 | \n
2 | \n1238 | \n987.555592 | \n
3 | \n1304 | \n1026.659660 | \n
4 | \n6901 | \n10610.572961 | \n
Let's score RMSE for this model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c36fb8df-812e-413f-a09c-bd23f4523de4", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE for custom scikit-learn model: 1372.2569123988917\n" ] } ], "source": [ "print(\"RMSE for custom scikit-learn model: \", mean_squared_error(y_test, y_pred, squared=False))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "cecdfee5-0da4-4422-8342-5084e4c51968", "showTitle": false, "title": "" } }, "source": [ "It's also possible to load the custom model as a Spark UDF using mlflow.pyfunc.spark_udf
and predict in a Spark dataframe:
carat | depth | table | x | y | z | prediction |
---|---|---|---|---|---|---|
0.24 | 62.1 | 56.0 | 3.97 | 4.0 | 2.47 | 581.0638013699318 |
0.58 | 60.0 | 57.0 | 5.44 | 5.42 | 3.26 | 8459.162219485595 |
0.4 | 62.1 | 55.0 | 4.76 | 4.74 | 2.95 | 1808.3104272102398 |
0.43 | 60.8 | 57.0 | 4.92 | 4.89 | 2.98 | 2573.6626953761706 |
1.55 | 62.3 | 55.0 | 7.44 | 7.37 | 4.61 | 14959.410444117686 |
Definition of custom xgboost model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "67fb560f-9d95-463b-88a6-665e1b994cdd", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "class xgboost_regressor(mlflow.pyfunc.PythonModel):\n", "\n", " def __init__(self, params):\n", " \"\"\" Initialize with just the model hyperparameters \"\"\"\n", " #\n", " self.params = params\n", " self.xgb_model = None\n", " self.config = None\n", " \n", " def load_context(self, context=None, config_path=None):\n", " \"\"\" When loading a pyfunc, this method runs automatically with the related\n", " context. This method is designed to perform the same functionality when\n", " run in a notebook or a downstream operation (like a REST endpoint).\n", " If the `context` object is provided, it will load the path to a config from \n", " that object (this happens with `mlflow.pyfunc.load_model()` is called).\n", " If the `config_path` argument is provided instead, it uses this argument\n", " in order to load in the config. \"\"\"\n", " #\n", " if context: # This block executes for server run\n", " config_path = context.artifacts[\"config_path\"]\n", " else: # This block executes for notebook run\n", " pass\n", "\n", " self.config = json.load(open(config_path))\n", " \n", " def preprocess_input(self, model_input):\n", " \"\"\" Return pre-processed model_input \"\"\"\n", " #\n", " # any preprocessing can be done there. For the example purpose, let's here apply a Standard Scaler\n", " from sklearn.preprocessing import StandardScaler\n", " #\n", " for c in list(model_input.columns):\n", " model_input[c] = StandardScaler().fit_transform(model_input[[c]])\n", " #\n", " return model_input\n", " \n", " def fit(self, X_train, y_train):\n", " \"\"\" Uses the same preprocessing logic to fit the model \"\"\"\n", " #\n", " from xgboost import XGBRegressor\n", " #\n", " processed_model_input = self.preprocess_input(X_train)\n", " xgb_model = XGBRegressor(**self.params)\n", " xgb_model.fit(processed_model_input, y_train)\n", " #\n", " self.xgb_model = xgb_model\n", " \n", " def predict(self, context, model_input):\n", " \"\"\" This is the main entrance to the model in deployment systems \"\"\"\n", " #\n", " processed_model_input = self.preprocess_input(model_input.copy())\n", " return self.xgb_model.predict(processed_model_input)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "140323f2-380a-474d-9471-b5c29a8cceb6", "showTitle": false, "title": "" } }, "source": [ "Definition of the parameters for the custom xgboost model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "271fbd54-06ee-403f-a5e5-c1ab7530d971", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "params_xgb = {\n", " \"n_estimators\": 1000, \n", " \"max_depth\": 7,\n", " \"eta\": 0.1,\n", " \"subsample\": 0.7,\n", " \"colsample_bytree\": 0.8\n", "}\n", "\n", "# Designate a path\n", "config_path_xgb = \"data_xgb.json\"\n", "\n", "# Save the results\n", "with open(config_path_xgb, \"w\") as f:\n", " json.dump(params_xgb, f)\n", "\n", "# Generate an artifact object to saved\n", "# All paths to the associated values will be copied over when saving\n", "artifacts_xgb = {\"config_path\": config_path_xgb} " ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "4f7cdfdd-16c2-469b-98c6-c774297afbc1", "showTitle": false, "title": "" } }, "source": [ "Instantiate the xgboost custom model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "7ab28086-0719-4df8-a845-df589b5cd5d5", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[91]: {'n_estimators': 1000,\n", " 'max_depth': 7,\n", " 'eta': 0.1,\n", " 'subsample': 0.7,\n", " 'colsample_bytree': 0.8}" ] } ], "source": [ "model_xgb = xgboost_regressor(params_xgb)\n", "#\n", "model_xgb.load_context(config_path=config_path_xgb) \n", "#\n", "# Confirm the config has loaded\n", "model_xgb.config" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "7ddbe15e-6298-439b-957c-cd5cb210bf9b", "showTitle": false, "title": "" } }, "source": [ "Train the xgboost custom model on training set:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "5efb1de4-19e4-4ffb-afd1-df193368efd7", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "model_xgb.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e5906970-186c-4790-a258-f1c12cfe98b5", "showTitle": false, "title": "" } }, "source": [ "Verify there can be predictions on test set using xgboost custom model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "f0fbaded-da90-4b38-a4db-7778f4349014", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "\n", " | actual prices | \n", "predictions | \n", "
---|---|---|
0 | \n", "559 | \n", "524.269531 | \n", "
1 | \n", "2201 | \n", "1795.014160 | \n", "
2 | \n", "1238 | \n", "1029.636230 | \n", "
3 | \n", "1304 | \n", "1096.781372 | \n", "
4 | \n", "6901 | \n", "10364.128906 | \n", "
\n | actual prices | \npredictions | \n
---|---|---|
0 | \n559 | \n524.269531 | \n
1 | \n2201 | \n1795.014160 | \n
2 | \n1238 | \n1029.636230 | \n
3 | \n1304 | \n1096.781372 | \n
4 | \n6901 | \n10364.128906 | \n
Optionally, prepare model signature:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c8cdbf39-9f42-4649-ad43-81e755a56918", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[94]: inputs: \n", " ['carat': double, 'depth': double, 'table': double, 'x': double, 'y': double, 'z': double]\n", "outputs: \n", " [Tensor('float32', (-1,))]" ] } ], "source": [ "signature_xgb = infer_signature(X_test, predictions_xgb)\n", "signature_xgb" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e35d968c-ff8b-4051-a43a-5ca629868e07", "showTitle": false, "title": "" } }, "source": [ "Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use mlflow.sklearn, we automatically log the appropriate version of sklearn. With a pyfunc, we must manually construct our deployment environment. See more details about it in this video." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "cf9a1aed-1f68-42ea-9ce9-61496ef40a05", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[95]: {'channels': ['defaults'],\n", " 'dependencies': ['python=3.9.5',\n", " 'pip',\n", " {'pip': ['mlflow', 'xgboost==1.6.2']}],\n", " 'name': 'xgboost_env'}" ] } ], "source": [ "conda_env_xgb = {\n", " \"channels\": [\"defaults\"],\n", " \"dependencies\": [\n", " f\"python={version_info.major}.{version_info.minor}.{version_info.micro}\",\n", " \"pip\",\n", " {\"pip\": [\"mlflow\",\n", " f\"xgboost=={xgboost.__version__}\"]\n", " },\n", " ],\n", " \"name\": \"xgboost_env\"\n", "}\n", "conda_env_xgb" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "cdc88282-cb0c-46e5-8ac3-e4c2c8e37ec0", "showTitle": false, "title": "" } }, "source": [ "Save the model using mlflow.pyfunc.log_model
using the parameters defined previously:
It is now possible to load the logged model using mlflow.pyfunc.load_model
and use it for predictions:
\n", " | actual prices | \n", "predictions | \n", "
---|---|---|
0 | \n", "559 | \n", "524.269531 | \n", "
1 | \n", "2201 | \n", "1795.014160 | \n", "
2 | \n", "1238 | \n", "1029.636230 | \n", "
3 | \n", "1304 | \n", "1096.781372 | \n", "
4 | \n", "6901 | \n", "10364.128906 | \n", "
\n | actual prices | \npredictions | \n
---|---|---|
0 | \n559 | \n524.269531 | \n
1 | \n2201 | \n1795.014160 | \n
2 | \n1238 | \n1029.636230 | \n
3 | \n1304 | \n1096.781372 | \n
4 | \n6901 | \n10364.128906 | \n
Let's score RMSE for this model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "4d58d9dd-4ba0-46d9-8330-074a6b787635", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE for custom xgboost model: 1457.7130185941312\n" ] } ], "source": [ "print(\"RMSE for custom xgboost model: \", mean_squared_error(y_test, y_pred_xgb, squared=False))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "1dcd5a9c-4a94-4bcb-839c-4d46906250cd", "showTitle": false, "title": "" } }, "source": [ "It's also possible to load the custom model as a Spark UDF using mlflow.pyfunc.spark_udf
and predict in a Spark dataframe:
carat | depth | table | x | y | z | prediction |
---|---|---|---|---|---|---|
0.24 | 62.1 | 56.0 | 3.97 | 4.0 | 2.47 | 818.0107421875 |
0.58 | 60.0 | 57.0 | 5.44 | 5.42 | 3.26 | 2764.87451171875 |
0.4 | 62.1 | 55.0 | 4.76 | 4.74 | 2.95 | 1567.1573486328125 |
0.43 | 60.8 | 57.0 | 4.92 | 4.89 | 2.98 | 1833.7303466796875 |
1.55 | 62.3 | 55.0 | 7.44 | 7.37 | 4.61 | 11748.9375 |