{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845", "showTitle": false, "title": "" } }, "source": [ "

Databricks-ML-professional-S02a-Preprocessing-Logic

" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "af7e15d6-d01f-4184-bbfb-2b17f41909d2", "showTitle": false, "title": "" } }, "source": [ "
\n", "
\n", "

This Notebook adds information related to the following requirements:


\n", "Preprocessing Logic:\n", "\n", "
\n", "

Download this notebook at format ipynb here.

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "2d6aaf81-c559-44bd-bc70-25852c40193d", "showTitle": false, "title": "" } }, "source": [ "\n", "
\n", "1. Describe an MLflow flavor and the benefits of using MLflow flavors
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "5f56a473-a96c-4e5e-9819-05c6d6d9f5e9", "showTitle": false, "title": "" } }, "source": [ "Flavor refers to the library of framework a ML model is built on.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "18e681ce-93ed-4c38-814e-6d851bb56281", "showTitle": false, "title": "" } }, "source": [ "\n", "
\n", "2. Describe the advantages of using the pyfunc MLflow flavor
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "b8964a28-4864-413f-8a84-dba563093362", "showTitle": false, "title": "" } }, "source": [ "

The python_function or pyfunc flavor of models gives a generic way of bundling models.

\n", "" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "b5f6d0da-1d81-4fa0-9770-a9e4d6863534", "showTitle": false, "title": "" } }, "source": [ "\n", "
\n", "3. Describe the process and benefits of including preprocessing logic and context in\n", "custom model classes and objects" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "881d8292-e64d-4ef3-9ed4-7be35a45f83b", "showTitle": false, "title": "" } }, "source": [ "
Let's illustrate this requirement with an example
\n", "

Log two models to mlflow using mlflow.pyfunc.log_model:

\n", "\n", "

And use them later the same way for prediction using pyfunc.load_model() function.

" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e1640d48-4fd7-4399-8581-def6b9419fc4", "showTitle": false, "title": "" } }, "source": [ "Load some libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "8a2d2e59-7426-4d5f-8d97-3dcff6e5151d", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "#\n", "import sklearn\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error\n", "#\n", "import mlflow\n", "from mlflow.models.signature import infer_signature\n", "#\n", "import logging\n", "import json \n", "import os\n", "from sys import version_info\n", "#\n", "import xgboost" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "85c05c1b-015d-405a-b6be-f8484a985d96", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "logging.getLogger(\"mlflow\").setLevel(logging.FATAL)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "a7530ffa-de90-4fa2-a8f3-e2864f0d55c1", "showTitle": false, "title": "" } }, "source": [ "

Prepare train and test sets:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "22e52c7e-16d0-4038-a862-831fcc9af0d7", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "diamonds_df = sns.load_dataset('diamonds').drop(['cut', 'color', 'clarity'], axis=1)\n", "#\n", "X_train, X_test, y_train, y_test = train_test_split(diamonds_df.drop([\"price\"], axis=1), diamonds_df[\"price\"], random_state=42)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "68a8a6ab-9cec-4179-8a52-5ca828c369e9", "showTitle": false, "title": "" } }, "source": [ "
\n", "scikit-learn
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "d3a73529-fad0-4339-b2d1-57e633760440", "showTitle": false, "title": "" } }, "source": [ "

Definition of custom scikit-learn model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "81984040-0ed5-42db-b176-93ee6a54c791", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "class sklearn_model(mlflow.pyfunc.PythonModel):\n", "\n", " def __init__(self, params):\n", " \"\"\" Initialize with just the model hyperparameters \"\"\"\n", " #\n", " self.params = params\n", " self.rf_model = None\n", " self.config = None\n", " \n", " def load_context(self, context=None, config_path=None):\n", " \"\"\" When loading a pyfunc, this method runs automatically with the related\n", " context. This method is designed to perform the same functionality when\n", " run in a notebook or a downstream operation (like a REST endpoint).\n", " If the `context` object is provided, it will load the path to a config from \n", " that object (this happens with `mlflow.pyfunc.load_model()` is called).\n", " If the `config_path` argument is provided instead, it uses this argument\n", " in order to load in the config. \"\"\"\n", " #\n", " if context: # This block executes for server run\n", " config_path = context.artifacts[\"config_path\"]\n", " else: # This block executes for notebook run\n", " pass\n", "\n", " self.config = json.load(open(config_path))\n", " \n", " def preprocess_input(self, model_input):\n", " \"\"\" Return pre-processed model_input \"\"\"\n", " #\n", " # any preprocessing can be done there. For the example purpose, let's just apply a Robust Scaler\n", " from sklearn.preprocessing import RobustScaler\n", " #\n", " for c in list(model_input.columns):\n", " model_input[c] = RobustScaler().fit_transform(model_input[[c]])\n", " #\n", " return model_input\n", " \n", " def fit(self, X_train, y_train):\n", " \"\"\" Uses the same preprocessing logic to fit the model \"\"\"\n", " #\n", " from sklearn.ensemble import RandomForestRegressor\n", " #\n", " processed_model_input = self.preprocess_input(X_train)\n", " rf_model = RandomForestRegressor(**self.params)\n", " rf_model.fit(processed_model_input, y_train)\n", " #\n", " self.rf_model = rf_model\n", " \n", " def predict(self, context, model_input):\n", " \"\"\" This is the main entrance to the model in deployment systems \"\"\"\n", " #\n", " processed_model_input = self.preprocess_input(model_input.copy())\n", " return self.rf_model.predict(processed_model_input)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bac0270f-8c69-41a7-bef4-d217a1525f0b", "showTitle": false, "title": "" } }, "source": [ "

Definition of the parameters for the custom scikit-learn model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "cda8a381-475b-44e7-91c9-21c7de85b578", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "params_sklearn = {\n", " \"n_estimators\": 15, \n", " \"max_depth\": 5\n", "}\n", "#\n", "# Designate a path\n", "config_path_sklearn = \"data_sklearn.json\"\n", "#\n", "# Save the results\n", "with open(config_path_sklearn, \"w\") as f:\n", " json.dump(params_sklearn, f)\n", "#\n", "# Generate an artifact object to saved\n", "# All paths to the associated values will be copied over when saving\n", "artifacts_sklearn = {\"config_path\": config_path_sklearn}" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "44f30da5-8646-4c98-812b-3e0e926dde4a", "showTitle": false, "title": "" } }, "source": [ "

Instantiate the scikit-learn custom model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "23e4dd34-2cf3-4d3e-988f-e858078a41ee", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[67]: {'n_estimators': 15, 'max_depth': 5}" ] } ], "source": [ "model_sk = sklearn_model(params_sklearn)\n", "#\n", "model_sk.load_context(config_path=config_path_sklearn) \n", "#\n", "# Confirm the config has loaded\n", "model_sk.config" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "35096555-48d2-43c4-b2b1-10d53684245a", "showTitle": false, "title": "" } }, "source": [ "

Train the scikit-learn custom model on training set:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "e136346b-8298-4267-9b8e-39f6cedcf04b", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "model_sk.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "a1ac131d-d420-45a5-af3a-80253cdd55cc", "showTitle": false, "title": "" } }, "source": [ "

Verify there can be predictions on test set using scikit-learn custom model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "88b04fee-e220-42bb-abee-13754e948e73", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
actual pricespredictions
0559581.063801
122011898.067468
21238987.555592
313041026.659660
4690110610.572961
\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
actual pricespredictions
0559581.063801
122011898.067468
21238987.555592
313041026.659660
4690110610.572961
\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "textData": null, "type": "htmlSandbox" } }, "output_type": "display_data" } ], "source": [ "predictions_sklearn = model_sk.predict(context=None, model_input=X_test)\n", "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(predictions_sklearn)}).head(5)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "683a4676-1485-40ee-8737-aa7557d36086", "showTitle": false, "title": "" } }, "source": [ "

Optionally, prepare model signature:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "1e5bbfd7-479a-4439-8faa-11330fba5c14", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[75]: inputs: \n", " ['carat': double, 'depth': double, 'table': double, 'x': double, 'y': double, 'z': double]\n", "outputs: \n", " [Tensor('float64', (-1,))]" ] } ], "source": [ "signature_sklearn = infer_signature(X_test, predictions_sklearn)\n", "signature_sklearn" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "996f35a2-9a72-484c-a088-b5e467a4d6be", "showTitle": false, "title": "" } }, "source": [ "Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use mlflow.sklearn, we automatically log the appropriate version of sklearn. With a pyfunc, we must manually construct our deployment environment. See more details about it in this video." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "76590fe1-6670-412d-a3b2-33f3c63bd939", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[77]: {'channels': ['defaults'],\n", " 'dependencies': ['python=3.9.5',\n", " 'pip',\n", " {'pip': ['mlflow', 'scikit-learn==0.24.2']}],\n", " 'name': 'sklearn_env'}" ] } ], "source": [ "conda_env_sklearn = {\n", " \"channels\": [\"defaults\"],\n", " \"dependencies\": [\n", " f\"python={version_info.major}.{version_info.minor}.{version_info.micro}\",\n", " \"pip\",\n", " {\"pip\": [\"mlflow\",\n", " f\"scikit-learn=={sklearn.__version__}\"]\n", " },\n", " ],\n", " \"name\": \"sklearn_env\"\n", "}\n", "conda_env_sklearn" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bac0a6cd-3e4e-4d96-a038-510a14f01f31", "showTitle": false, "title": "" } }, "source": [ "

Save the model using mlflow.pyfunc.log_model using the parameters defined previously:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "dde85070-85b9-408e-ada7-95e873b741fe", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "with mlflow.start_run() as run:\n", " mlflow.pyfunc.log_model(\n", " \"sklearn_RFR\", \n", " python_model=model_sk, \n", " artifacts=artifacts_sklearn,\n", " conda_env=conda_env_sklearn,\n", " signature=signature_sklearn,\n", " input_example=X_test[:3] \n", " )" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "49e76b67-a498-4c5a-8918-2bea2abacf41", "showTitle": false, "title": "" } }, "source": [ "

It is now possible to load the logged model using mlflow.pyfunc.load_model and use it for predictions:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "400c1358-9af2-4788-bdbb-16c955a7daa7", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
actual pricespredictions
0559581.063801
122011898.067468
21238987.555592
313041026.659660
4690110610.572961
\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
actual pricespredictions
0559581.063801
122011898.067468
21238987.555592
313041026.659660
4690110610.572961
\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "textData": null, "type": "htmlSandbox" } }, "output_type": "display_data" } ], "source": [ "mlflow_pyfunc_model_path_sk = f\"runs:/{run.info.run_id}/sklearn_RFR\"\n", "loaded_preprocess_model_sk = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path_sk)\n", "#\n", "y_pred = loaded_preprocess_model_sk.predict(X_test)\n", "#\n", "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(y_pred)}).head(5)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "3a5cf86f-6ebf-4fe9-bfe3-43f9f8214c4f", "showTitle": false, "title": "" } }, "source": [ "

Let's score RMSE for this model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c36fb8df-812e-413f-a09c-bd23f4523de4", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE for custom scikit-learn model: 1372.2569123988917\n" ] } ], "source": [ "print(\"RMSE for custom scikit-learn model: \", mean_squared_error(y_test, y_pred, squared=False))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "cecdfee5-0da4-4422-8342-5084e4c51968", "showTitle": false, "title": "" } }, "source": [ "

It's also possible to load the custom model as a Spark UDF using mlflow.pyfunc.spark_udf and predict in a Spark dataframe:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "ce08403d-b575-466b-9108-0e4bc2446efb", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
caratdepthtablexyzprediction
0.2462.156.03.974.02.47581.0638013699318
0.5860.057.05.445.423.268459.162219485595
0.462.155.04.764.742.951808.3104272102398
0.4360.857.04.924.892.982573.6626953761706
1.5562.355.07.447.374.6114959.410444117686
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ 0.24, 62.1, 56, 3.97, 4, 2.47, 581.0638013699318 ], [ 0.58, 60, 57, 5.44, 5.42, 3.26, 8459.162219485595 ], [ 0.4, 62.1, 55, 4.76, 4.74, 2.95, 1808.3104272102398 ], [ 0.43, 60.8, 57, 4.92, 4.89, 2.98, 2573.6626953761706 ], [ 1.55, 62.3, 55, 7.44, 7.37, 4.61, 14959.410444117686 ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "carat", "type": "\"double\"" }, { "metadata": "{}", "name": "depth", "type": "\"double\"" }, { "metadata": "{}", "name": "table", "type": "\"double\"" }, { "metadata": "{}", "name": "x", "type": "\"double\"" }, { "metadata": "{}", "name": "y", "type": "\"double\"" }, { "metadata": "{}", "name": "z", "type": "\"double\"" }, { "metadata": "{}", "name": "prediction", "type": "\"double\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "sklearn_custom_predict = mlflow.pyfunc.spark_udf(spark, mlflow_pyfunc_model_path_sk)\n", "#\n", "display(spark.createDataFrame(X_test).withColumn('prediction', sklearn_custom_predict(*['carat', 'depth', 'table', 'x', 'y', 'z'])).limit(5))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "0a027803-ec01-42f9-8828-5b13fc61799e", "showTitle": false, "title": "" } }, "source": [ "
\n", "xgboost
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "c7fc1115-6c8c-4e8c-a980-3cbb4850a89c", "showTitle": false, "title": "" } }, "source": [ "

Definition of custom xgboost model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "67fb560f-9d95-463b-88a6-665e1b994cdd", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "class xgboost_regressor(mlflow.pyfunc.PythonModel):\n", "\n", " def __init__(self, params):\n", " \"\"\" Initialize with just the model hyperparameters \"\"\"\n", " #\n", " self.params = params\n", " self.xgb_model = None\n", " self.config = None\n", " \n", " def load_context(self, context=None, config_path=None):\n", " \"\"\" When loading a pyfunc, this method runs automatically with the related\n", " context. This method is designed to perform the same functionality when\n", " run in a notebook or a downstream operation (like a REST endpoint).\n", " If the `context` object is provided, it will load the path to a config from \n", " that object (this happens with `mlflow.pyfunc.load_model()` is called).\n", " If the `config_path` argument is provided instead, it uses this argument\n", " in order to load in the config. \"\"\"\n", " #\n", " if context: # This block executes for server run\n", " config_path = context.artifacts[\"config_path\"]\n", " else: # This block executes for notebook run\n", " pass\n", "\n", " self.config = json.load(open(config_path))\n", " \n", " def preprocess_input(self, model_input):\n", " \"\"\" Return pre-processed model_input \"\"\"\n", " #\n", " # any preprocessing can be done there. For the example purpose, let's here apply a Standard Scaler\n", " from sklearn.preprocessing import StandardScaler\n", " #\n", " for c in list(model_input.columns):\n", " model_input[c] = StandardScaler().fit_transform(model_input[[c]])\n", " #\n", " return model_input\n", " \n", " def fit(self, X_train, y_train):\n", " \"\"\" Uses the same preprocessing logic to fit the model \"\"\"\n", " #\n", " from xgboost import XGBRegressor\n", " #\n", " processed_model_input = self.preprocess_input(X_train)\n", " xgb_model = XGBRegressor(**self.params)\n", " xgb_model.fit(processed_model_input, y_train)\n", " #\n", " self.xgb_model = xgb_model\n", " \n", " def predict(self, context, model_input):\n", " \"\"\" This is the main entrance to the model in deployment systems \"\"\"\n", " #\n", " processed_model_input = self.preprocess_input(model_input.copy())\n", " return self.xgb_model.predict(processed_model_input)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "140323f2-380a-474d-9471-b5c29a8cceb6", "showTitle": false, "title": "" } }, "source": [ "

Definition of the parameters for the custom xgboost model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "271fbd54-06ee-403f-a5e5-c1ab7530d971", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "params_xgb = {\n", " \"n_estimators\": 1000, \n", " \"max_depth\": 7,\n", " \"eta\": 0.1,\n", " \"subsample\": 0.7,\n", " \"colsample_bytree\": 0.8\n", "}\n", "\n", "# Designate a path\n", "config_path_xgb = \"data_xgb.json\"\n", "\n", "# Save the results\n", "with open(config_path_xgb, \"w\") as f:\n", " json.dump(params_xgb, f)\n", "\n", "# Generate an artifact object to saved\n", "# All paths to the associated values will be copied over when saving\n", "artifacts_xgb = {\"config_path\": config_path_xgb} " ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "4f7cdfdd-16c2-469b-98c6-c774297afbc1", "showTitle": false, "title": "" } }, "source": [ "

Instantiate the xgboost custom model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "7ab28086-0719-4df8-a845-df589b5cd5d5", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[91]: {'n_estimators': 1000,\n", " 'max_depth': 7,\n", " 'eta': 0.1,\n", " 'subsample': 0.7,\n", " 'colsample_bytree': 0.8}" ] } ], "source": [ "model_xgb = xgboost_regressor(params_xgb)\n", "#\n", "model_xgb.load_context(config_path=config_path_xgb) \n", "#\n", "# Confirm the config has loaded\n", "model_xgb.config" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "7ddbe15e-6298-439b-957c-cd5cb210bf9b", "showTitle": false, "title": "" } }, "source": [ "

Train the xgboost custom model on training set:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "5efb1de4-19e4-4ffb-afd1-df193368efd7", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "model_xgb.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e5906970-186c-4790-a258-f1c12cfe98b5", "showTitle": false, "title": "" } }, "source": [ "

Verify there can be predictions on test set using xgboost custom model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "f0fbaded-da90-4b38-a4db-7778f4349014", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
actual pricespredictions
0559524.269531
122011795.014160
212381029.636230
313041096.781372
4690110364.128906
\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
actual pricespredictions
0559524.269531
122011795.014160
212381029.636230
313041096.781372
4690110364.128906
\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "textData": null, "type": "htmlSandbox" } }, "output_type": "display_data" } ], "source": [ "predictions_xgb = model_xgb.predict(context=None, model_input=X_test)\n", "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(predictions_xgb)}).head(5)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "15084673-a76b-4a3c-810c-dde930326e9f", "showTitle": false, "title": "" } }, "source": [ "

Optionally, prepare model signature:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c8cdbf39-9f42-4649-ad43-81e755a56918", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[94]: inputs: \n", " ['carat': double, 'depth': double, 'table': double, 'x': double, 'y': double, 'z': double]\n", "outputs: \n", " [Tensor('float32', (-1,))]" ] } ], "source": [ "signature_xgb = infer_signature(X_test, predictions_xgb)\n", "signature_xgb" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e35d968c-ff8b-4051-a43a-5ca629868e07", "showTitle": false, "title": "" } }, "source": [ "Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use mlflow.sklearn, we automatically log the appropriate version of sklearn. With a pyfunc, we must manually construct our deployment environment. See more details about it in this video." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "cf9a1aed-1f68-42ea-9ce9-61496ef40a05", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[95]: {'channels': ['defaults'],\n", " 'dependencies': ['python=3.9.5',\n", " 'pip',\n", " {'pip': ['mlflow', 'xgboost==1.6.2']}],\n", " 'name': 'xgboost_env'}" ] } ], "source": [ "conda_env_xgb = {\n", " \"channels\": [\"defaults\"],\n", " \"dependencies\": [\n", " f\"python={version_info.major}.{version_info.minor}.{version_info.micro}\",\n", " \"pip\",\n", " {\"pip\": [\"mlflow\",\n", " f\"xgboost=={xgboost.__version__}\"]\n", " },\n", " ],\n", " \"name\": \"xgboost_env\"\n", "}\n", "conda_env_xgb" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "cdc88282-cb0c-46e5-8ac3-e4c2c8e37ec0", "showTitle": false, "title": "" } }, "source": [ "

Save the model using mlflow.pyfunc.log_model using the parameters defined previously:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "bdea7b0a-ffdf-4c61-9d80-32676c9db9ba", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "with mlflow.start_run() as run:\n", " mlflow.pyfunc.log_model(\n", " \"xgb_regressor\", \n", " python_model=model_xgb, \n", " artifacts=artifacts_xgb,\n", " conda_env=conda_env_xgb,\n", " signature=signature_xgb,\n", " input_example=X_test[:3] \n", " )" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "dff26cac-882a-4a38-b16b-1b36ea8529a2", "showTitle": false, "title": "" } }, "source": [ "

It is now possible to load the logged model using mlflow.pyfunc.load_model and use it for predictions:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "5e367849-9034-4355-9405-b7831ecb2912", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
actual pricespredictions
0559524.269531
122011795.014160
212381029.636230
313041096.781372
4690110364.128906
\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
actual pricespredictions
0559524.269531
122011795.014160
212381029.636230
313041096.781372
4690110364.128906
\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "textData": null, "type": "htmlSandbox" } }, "output_type": "display_data" } ], "source": [ "mlflow_pyfunc_model_path_xgb = f\"runs:/{run.info.run_id}/xgb_regressor\"\n", "loaded_preprocess_model_xgb = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path_xgb)\n", "#\n", "y_pred_xgb = loaded_preprocess_model_xgb.predict(X_test)\n", "#\n", "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(y_pred_xgb)}).head(5)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "612c7afb-1c93-4bf7-94ec-90131208fc3f", "showTitle": false, "title": "" } }, "source": [ "

Let's score RMSE for this model:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "4d58d9dd-4ba0-46d9-8330-074a6b787635", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE for custom xgboost model: 1457.7130185941312\n" ] } ], "source": [ "print(\"RMSE for custom xgboost model: \", mean_squared_error(y_test, y_pred_xgb, squared=False))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "1dcd5a9c-4a94-4bcb-839c-4d46906250cd", "showTitle": false, "title": "" } }, "source": [ "

It's also possible to load the custom model as a Spark UDF using mlflow.pyfunc.spark_udf and predict in a Spark dataframe:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "080cb52f-0329-419b-886f-af8cb911f734", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
caratdepthtablexyzprediction
0.2462.156.03.974.02.47818.0107421875
0.5860.057.05.445.423.262764.87451171875
0.462.155.04.764.742.951567.1573486328125
0.4360.857.04.924.892.981833.7303466796875
1.5562.355.07.447.374.6111748.9375
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ 0.24, 62.1, 56, 3.97, 4, 2.47, 818.0107421875 ], [ 0.58, 60, 57, 5.44, 5.42, 3.26, 2764.87451171875 ], [ 0.4, 62.1, 55, 4.76, 4.74, 2.95, 1567.1573486328125 ], [ 0.43, 60.8, 57, 4.92, 4.89, 2.98, 1833.7303466796875 ], [ 1.55, 62.3, 55, 7.44, 7.37, 4.61, 11748.9375 ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "carat", "type": "\"double\"" }, { "metadata": "{}", "name": "depth", "type": "\"double\"" }, { "metadata": "{}", "name": "table", "type": "\"double\"" }, { "metadata": "{}", "name": "x", "type": "\"double\"" }, { "metadata": "{}", "name": "y", "type": "\"double\"" }, { "metadata": "{}", "name": "z", "type": "\"double\"" }, { "metadata": "{}", "name": "prediction", "type": "\"double\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "xgboost_custom_predict = mlflow.pyfunc.spark_udf(spark, mlflow_pyfunc_model_path_xgb)\n", "#\n", "display(spark.createDataFrame(X_test).withColumn('prediction', xgboost_custom_predict(*['carat', 'depth', 'table', 'x', 'y', 'z'])).limit(5))" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "mostRecentlyExecutedCommandWithImplicitDF": { "commandId": -1, "dataframes": [ "_sqldf" ] }, "pythonIndentUnit": 2 }, "notebookName": "Databricks-ML-professional-S02a-Preprocessing-Logic", "widgets": {} }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }