{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<h4 style=\"font-variant-caps: small-caps;font-size:35pt;\">Databricks-ML-professional-S02a-Preprocessing-Logic</h4>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "af7e15d6-d01f-4184-bbfb-2b17f41909d2",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<div style='background-color:black;border-radius:5px;border-top:1px solid'></div>\n",
    "<br/>\n",
    "<p>This Notebook adds information related to the following requirements:</p><br/>\n",
    "<b>Preprocessing Logic:</b>\n",
    "<ul>\n",
    "<li>Describe an MLflow flavor and the benefits of using MLflow flavors</li>\n",
    "<li>Describe the advantages of using the pyfunc MLflow flavor</li>\n",
    "<li>Describe the process and benefits of including preprocessing logic and context in custom model classes and objects</li>\n",
    "</ul>\n",
    "<br/>\n",
    "<p><b>Download this notebook at format ipynb <a href=\"Databricks-ML-professional-S02a-Preprocessing-Logic.ipynb\">here</a>.</b></p>\n",
    "<br/>\n",
    "<div style='background-color:black;border-radius:5px;border-top:1px solid'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "2d6aaf81-c559-44bd-bc70-25852c40193d",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<a id=\"mlflowflavor\"></a>\n",
    "<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>\n",
    "<span style=\"font-variant-caps: small-caps;font-weight:700\">1. Describe an MLflow flavor and the benefits of using MLflow flavors</span></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "5f56a473-a96c-4e5e-9819-05c6d6d9f5e9",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<b>Flavor refers to the library of framework a ML model is built on.</b>\n",
    "<ul>\n",
    "<li>For example the flavor of a model can be - among others -\n",
    "<ul><li><b>scikit-learn</b>: <code>mlflow.sklearn</code></li>\n",
    "    <li><b>Spark ML</b>: <code>mlflow.spark</code></li>\n",
    "    <li><b>Keras</b>: <code>mlflow.keras</code></li>\n",
    "</ul></li>\n",
    "<li><span style=\"text-decoration:underline\">Problem</span>: keeping models in their native flavor can make the deployments more challenging. </li>\n",
    "<li>The idea of flavor is to unify model artifact and api regardless the library used to build the model.</li>\n",
    "<li>This contributes to break siloes and make collaboration easier.</li>\n",
    "</ul>\n",
    "<div style=\"display:block;text-align:center\"><img width=\"500px\" src=\"https://i.ibb.co/z4ZG1wc/mlflow7.png\"/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "18e681ce-93ed-4c38-814e-6d851bb56281",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<a id=\"pyfuncmlflow\"></a>\n",
    "<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>\n",
    "<span style=\"font-variant-caps: small-caps;font-weight:700\">2. Describe the advantages of using the pyfunc MLflow flavor</span></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "b8964a28-4864-413f-8a84-dba563093362",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p><b>The <code>python_function</code> or <code>pyfunc</code> flavor of models gives a generic way of bundling models.</b></p>\n",
    "<ul>\n",
    "<li>Let deploy a python function without worrying about the underlying format of the model</li>\n",
    "<li>MLflow therefore maps any training framework to any deployment using these platform-agnostic representations, massively reducing the complexity of inference.</li>\n",
    "<li>A <code>pyfunc</code> is a generic python model that can define any arbitrary logic, regardless of the libraries used to train it.</li>\n",
    "<li>This object interoperates with any MLflow functionality, especially downstream scoring tools. As such, it's defined as a class with a related directory structure with all of the dependencies.</li>\n",
    "<li> It is then \"just an object\" with a various methods such as a predict method. Since it makes very few assumptions, it can be deployed using MLflow, SageMaker, a Spark UDF, or in any other environment.</li></ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "b5f6d0da-1d81-4fa0-9770-a9e4d6863534",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<a id=\"custommodels\"></a>\n",
    "<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>\n",
    "<span style=\"font-variant-caps: small-caps;font-weight:700\">3. Describe the process and benefits of including preprocessing logic and context in\n",
    "custom model classes and objects</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "881d8292-e64d-4ef3-9ed4-7be35a45f83b",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<div style='border-radius:5px;padding:2px;'><span style=\"font-variant-caps: small-caps;font-weight:700\">Let's illustrate this requirement with an example</span></div>\n",
    "<p>Log two models to mlflow using <code>mlflow.pyfunc.log_model</code>:</p>\n",
    "<ul><li>1 scikit-learn model</li><li>1 xgboost model</li></ul>\n",
    "<p>And use them later the same way for prediction using <code>pyfunc.load_model()</code> function.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "e1640d48-4fd7-4399-8581-def6b9419fc4",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<b>Load some libraries</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "8a2d2e59-7426-4d5f-8d97-3dcff6e5151d",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "#\n",
    "import sklearn\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_squared_error\n",
    "#\n",
    "import mlflow\n",
    "from mlflow.models.signature import infer_signature\n",
    "#\n",
    "import logging\n",
    "import json \n",
    "import os\n",
    "from sys import version_info\n",
    "#\n",
    "import xgboost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "85c05c1b-015d-405a-b6be-f8484a985d96",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "logging.getLogger(\"mlflow\").setLevel(logging.FATAL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a7530ffa-de90-4fa2-a8f3-e2864f0d55c1",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Prepare train and test sets:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "22e52c7e-16d0-4038-a862-831fcc9af0d7",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "diamonds_df = sns.load_dataset('diamonds').drop(['cut', 'color', 'clarity'], axis=1)\n",
    "#\n",
    "X_train, X_test, y_train, y_test = train_test_split(diamonds_df.drop([\"price\"], axis=1), diamonds_df[\"price\"], random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "68a8a6ab-9cec-4179-8a52-5ca828c369e9",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<div style='background-color:rgba(0, 139, 69, 0.2);border-radius:5px;padding:2px;padding-left:10px'>\n",
    "<span style=\"font-variant-caps: small-caps;font-weight:700\">scikit-learn</span></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "d3a73529-fad0-4339-b2d1-57e633760440",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Definition of custom <b>scikit-learn model</b>:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "81984040-0ed5-42db-b176-93ee6a54c791",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "class sklearn_model(mlflow.pyfunc.PythonModel):\n",
    "\n",
    "    def __init__(self, params):\n",
    "        \"\"\" Initialize with just the model hyperparameters \"\"\"\n",
    "        #\n",
    "        self.params = params\n",
    "        self.rf_model = None\n",
    "        self.config = None\n",
    "        \n",
    "    def load_context(self, context=None, config_path=None):\n",
    "        \"\"\" When loading a pyfunc, this method runs automatically with the related\n",
    "            context. This method is designed to perform the same functionality when\n",
    "            run in a notebook or a downstream operation (like a REST endpoint).\n",
    "            If the `context` object is provided, it will load the path to a config from \n",
    "            that object (this happens with `mlflow.pyfunc.load_model()` is called).\n",
    "            If the `config_path` argument is provided instead, it uses this argument\n",
    "            in order to load in the config. \"\"\"\n",
    "        #\n",
    "        if context: # This block executes for server run\n",
    "            config_path = context.artifacts[\"config_path\"]\n",
    "        else:       # This block executes for notebook run\n",
    "            pass\n",
    "\n",
    "        self.config = json.load(open(config_path))\n",
    "      \n",
    "    def preprocess_input(self, model_input):\n",
    "        \"\"\" Return pre-processed model_input \"\"\"\n",
    "        #\n",
    "        # any preprocessing can be done there. For the example purpose, let's just apply a Robust Scaler\n",
    "        from sklearn.preprocessing import RobustScaler\n",
    "        #\n",
    "        for c in list(model_input.columns):\n",
    "            model_input[c] = RobustScaler().fit_transform(model_input[[c]])\n",
    "        #\n",
    "        return model_input\n",
    "  \n",
    "    def fit(self, X_train, y_train):\n",
    "        \"\"\" Uses the same preprocessing logic to fit the model \"\"\"\n",
    "        #\n",
    "        from sklearn.ensemble import RandomForestRegressor\n",
    "        #\n",
    "        processed_model_input = self.preprocess_input(X_train)\n",
    "        rf_model = RandomForestRegressor(**self.params)\n",
    "        rf_model.fit(processed_model_input, y_train)\n",
    "        #\n",
    "        self.rf_model = rf_model\n",
    "    \n",
    "    def predict(self, context, model_input):\n",
    "        \"\"\" This is the main entrance to the model in deployment systems \"\"\"\n",
    "        #\n",
    "        processed_model_input = self.preprocess_input(model_input.copy())\n",
    "        return self.rf_model.predict(processed_model_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "bac0270f-8c69-41a7-bef4-d217a1525f0b",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Definition of the parameters for the custom scikit-learn model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "cda8a381-475b-44e7-91c9-21c7de85b578",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "params_sklearn = {\n",
    "    \"n_estimators\": 15, \n",
    "    \"max_depth\": 5\n",
    "}\n",
    "#\n",
    "# Designate a path\n",
    "config_path_sklearn = \"data_sklearn.json\"\n",
    "#\n",
    "# Save the results\n",
    "with open(config_path_sklearn, \"w\") as f:\n",
    "    json.dump(params_sklearn, f)\n",
    "#\n",
    "# Generate an artifact object to saved\n",
    "# All paths to the associated values will be copied over when saving\n",
    "artifacts_sklearn = {\"config_path\": config_path_sklearn}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "44f30da5-8646-4c98-812b-3e0e926dde4a",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Instantiate the scikit-learn custom model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "23e4dd34-2cf3-4d3e-988f-e858078a41ee",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out[67]: {'n_estimators': 15, 'max_depth': 5}"
     ]
    }
   ],
   "source": [
    "model_sk = sklearn_model(params_sklearn)\n",
    "#\n",
    "model_sk.load_context(config_path=config_path_sklearn) \n",
    "#\n",
    "# Confirm the config has loaded\n",
    "model_sk.config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "35096555-48d2-43c4-b2b1-10d53684245a",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Train the scikit-learn custom model on training set:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "e136346b-8298-4267-9b8e-39f6cedcf04b",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "model_sk.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a1ac131d-d420-45a5-af3a-80253cdd55cc",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Verify there can be predictions on test set using scikit-learn custom model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "88b04fee-e220-42bb-abee-13754e948e73",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual prices</th>\n",
       "      <th>predictions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>559</td>\n",
       "      <td>581.063801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2201</td>\n",
       "      <td>1898.067468</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1238</td>\n",
       "      <td>987.555592</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1304</td>\n",
       "      <td>1026.659660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6901</td>\n",
       "      <td>10610.572961</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "arguments": {},
       "data": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>actual prices</th>\n      <th>predictions</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>559</td>\n      <td>581.063801</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2201</td>\n      <td>1898.067468</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1238</td>\n      <td>987.555592</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1304</td>\n      <td>1026.659660</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>6901</td>\n      <td>10610.572961</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
       "datasetInfos": [],
       "metadata": {},
       "removedWidgets": [],
       "textData": null,
       "type": "htmlSandbox"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "predictions_sklearn = model_sk.predict(context=None, model_input=X_test)\n",
    "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(predictions_sklearn)}).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "683a4676-1485-40ee-8737-aa7557d36086",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Optionally, prepare model signature:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "1e5bbfd7-479a-4439-8faa-11330fba5c14",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out[75]: inputs: \n",
      "  ['carat': double, 'depth': double, 'table': double, 'x': double, 'y': double, 'z': double]\n",
      "outputs: \n",
      "  [Tensor('float64', (-1,))]"
     ]
    }
   ],
   "source": [
    "signature_sklearn = infer_signature(X_test, predictions_sklearn)\n",
    "signature_sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "996f35a2-9a72-484c-a088-b5e467a4d6be",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<i>Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use <b>mlflow.sklearn</b>, we automatically log the appropriate version of <b>sklearn</b>. With a <b>pyfunc</b>, we must manually construct our deployment environment. See more details about it in <a href=\"https://customer-academy.databricks.com/learn/course/1522/play/9698/model-management-demo\" target=\"_blank\">this video</a>.</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "76590fe1-6670-412d-a3b2-33f3c63bd939",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out[77]: {'channels': ['defaults'],\n",
      " 'dependencies': ['python=3.9.5',\n",
      "  'pip',\n",
      "  {'pip': ['mlflow', 'scikit-learn==0.24.2']}],\n",
      " 'name': 'sklearn_env'}"
     ]
    }
   ],
   "source": [
    "conda_env_sklearn = {\n",
    "    \"channels\": [\"defaults\"],\n",
    "    \"dependencies\": [\n",
    "        f\"python={version_info.major}.{version_info.minor}.{version_info.micro}\",\n",
    "        \"pip\",\n",
    "        {\"pip\": [\"mlflow\",\n",
    "                 f\"scikit-learn=={sklearn.__version__}\"]\n",
    "        },\n",
    "    ],\n",
    "    \"name\": \"sklearn_env\"\n",
    "}\n",
    "conda_env_sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "bac0a6cd-3e4e-4d96-a038-510a14f01f31",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Save the model using <code>mlflow.pyfunc.log_model</code> using the parameters defined previously:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "dde85070-85b9-408e-ada7-95e873b741fe",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "with mlflow.start_run() as run:\n",
    "    mlflow.pyfunc.log_model(\n",
    "        \"sklearn_RFR\", \n",
    "        python_model=model_sk, \n",
    "        artifacts=artifacts_sklearn,\n",
    "        conda_env=conda_env_sklearn,\n",
    "        signature=signature_sklearn,\n",
    "        input_example=X_test[:3] \n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "49e76b67-a498-4c5a-8918-2bea2abacf41",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>It is now possible to load the logged model using <code>mlflow.pyfunc.load_model</code> and use it for predictions:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "400c1358-9af2-4788-bdbb-16c955a7daa7",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual prices</th>\n",
       "      <th>predictions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>559</td>\n",
       "      <td>581.063801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2201</td>\n",
       "      <td>1898.067468</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1238</td>\n",
       "      <td>987.555592</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1304</td>\n",
       "      <td>1026.659660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6901</td>\n",
       "      <td>10610.572961</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "arguments": {},
       "data": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>actual prices</th>\n      <th>predictions</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>559</td>\n      <td>581.063801</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2201</td>\n      <td>1898.067468</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1238</td>\n      <td>987.555592</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1304</td>\n      <td>1026.659660</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>6901</td>\n      <td>10610.572961</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
       "datasetInfos": [],
       "metadata": {},
       "removedWidgets": [],
       "textData": null,
       "type": "htmlSandbox"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "mlflow_pyfunc_model_path_sk = f\"runs:/{run.info.run_id}/sklearn_RFR\"\n",
    "loaded_preprocess_model_sk = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path_sk)\n",
    "#\n",
    "y_pred = loaded_preprocess_model_sk.predict(X_test)\n",
    "#\n",
    "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(y_pred)}).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "3a5cf86f-6ebf-4fe9-bfe3-43f9f8214c4f",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Let's score RMSE for this model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "c36fb8df-812e-413f-a09c-bd23f4523de4",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE for custom scikit-learn model:  1372.2569123988917\n"
     ]
    }
   ],
   "source": [
    "print(\"RMSE for custom scikit-learn model: \", mean_squared_error(y_test, y_pred, squared=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "cecdfee5-0da4-4422-8342-5084e4c51968",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>It's also possible to load the custom model as a <b>Spark UDF</b> using <code>mlflow.pyfunc.spark_udf</code> and predict in a Spark dataframe:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "ce08403d-b575-466b-9108-0e4bc2446efb",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style scoped>\n",
       "  .table-result-container {\n",
       "    max-height: 300px;\n",
       "    overflow: auto;\n",
       "  }\n",
       "  table, th, td {\n",
       "    border: 1px solid black;\n",
       "    border-collapse: collapse;\n",
       "  }\n",
       "  th, td {\n",
       "    padding: 5px;\n",
       "  }\n",
       "  th {\n",
       "    text-align: left;\n",
       "  }\n",
       "</style><div class='table-result-container'><table class='table-result'><thead style='background-color: white'><tr><th>carat</th><th>depth</th><th>table</th><th>x</th><th>y</th><th>z</th><th>prediction</th></tr></thead><tbody><tr><td>0.24</td><td>62.1</td><td>56.0</td><td>3.97</td><td>4.0</td><td>2.47</td><td>581.0638013699318</td></tr><tr><td>0.58</td><td>60.0</td><td>57.0</td><td>5.44</td><td>5.42</td><td>3.26</td><td>8459.162219485595</td></tr><tr><td>0.4</td><td>62.1</td><td>55.0</td><td>4.76</td><td>4.74</td><td>2.95</td><td>1808.3104272102398</td></tr><tr><td>0.43</td><td>60.8</td><td>57.0</td><td>4.92</td><td>4.89</td><td>2.98</td><td>2573.6626953761706</td></tr><tr><td>1.55</td><td>62.3</td><td>55.0</td><td>7.44</td><td>7.37</td><td>4.61</td><td>14959.410444117686</td></tr></tbody></table></div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "aggData": [],
       "aggError": "",
       "aggOverflow": false,
       "aggSchema": [],
       "aggSeriesLimitReached": false,
       "aggType": "",
       "arguments": {},
       "columnCustomDisplayInfos": {},
       "data": [
        [
         0.24,
         62.1,
         56,
         3.97,
         4,
         2.47,
         581.0638013699318
        ],
        [
         0.58,
         60,
         57,
         5.44,
         5.42,
         3.26,
         8459.162219485595
        ],
        [
         0.4,
         62.1,
         55,
         4.76,
         4.74,
         2.95,
         1808.3104272102398
        ],
        [
         0.43,
         60.8,
         57,
         4.92,
         4.89,
         2.98,
         2573.6626953761706
        ],
        [
         1.55,
         62.3,
         55,
         7.44,
         7.37,
         4.61,
         14959.410444117686
        ]
       ],
       "datasetInfos": [],
       "dbfsResultPath": null,
       "isJsonSchema": true,
       "metadata": {},
       "overflow": false,
       "plotOptions": {
        "customPlotOptions": {},
        "displayType": "table",
        "pivotAggregation": null,
        "pivotColumns": null,
        "xColumns": null,
        "yColumns": null
       },
       "removedWidgets": [],
       "schema": [
        {
         "metadata": "{}",
         "name": "carat",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "depth",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "table",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "x",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "y",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "z",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "prediction",
         "type": "\"double\""
        }
       ],
       "type": "table"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sklearn_custom_predict = mlflow.pyfunc.spark_udf(spark, mlflow_pyfunc_model_path_sk)\n",
    "#\n",
    "display(spark.createDataFrame(X_test).withColumn('prediction', sklearn_custom_predict(*['carat', 'depth', 'table', 'x', 'y', 'z'])).limit(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "0a027803-ec01-42f9-8828-5b13fc61799e",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<div style='background-color:rgba(0, 139, 69, 0.2);border-radius:5px;padding:2px;padding-left:10px'>\n",
    "<span style=\"font-variant-caps: small-caps;font-weight:700\">xgboost</span></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "c7fc1115-6c8c-4e8c-a980-3cbb4850a89c",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Definition of <b>custom xgboost model</b>:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "67fb560f-9d95-463b-88a6-665e1b994cdd",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "class xgboost_regressor(mlflow.pyfunc.PythonModel):\n",
    "\n",
    "    def __init__(self, params):\n",
    "        \"\"\" Initialize with just the model hyperparameters \"\"\"\n",
    "        #\n",
    "        self.params = params\n",
    "        self.xgb_model = None\n",
    "        self.config = None\n",
    "        \n",
    "    def load_context(self, context=None, config_path=None):\n",
    "        \"\"\" When loading a pyfunc, this method runs automatically with the related\n",
    "            context. This method is designed to perform the same functionality when\n",
    "            run in a notebook or a downstream operation (like a REST endpoint).\n",
    "            If the `context` object is provided, it will load the path to a config from \n",
    "            that object (this happens with `mlflow.pyfunc.load_model()` is called).\n",
    "            If the `config_path` argument is provided instead, it uses this argument\n",
    "            in order to load in the config. \"\"\"\n",
    "        #\n",
    "        if context: # This block executes for server run\n",
    "            config_path = context.artifacts[\"config_path\"]\n",
    "        else:       # This block executes for notebook run\n",
    "            pass\n",
    "\n",
    "        self.config = json.load(open(config_path))\n",
    "      \n",
    "    def preprocess_input(self, model_input):\n",
    "        \"\"\" Return pre-processed model_input \"\"\"\n",
    "        #\n",
    "        # any preprocessing can be done there. For the example purpose, let's here apply a Standard Scaler\n",
    "        from sklearn.preprocessing import StandardScaler\n",
    "        #\n",
    "        for c in list(model_input.columns):\n",
    "            model_input[c] = StandardScaler().fit_transform(model_input[[c]])\n",
    "        #\n",
    "        return model_input\n",
    "  \n",
    "    def fit(self, X_train, y_train):\n",
    "        \"\"\" Uses the same preprocessing logic to fit the model \"\"\"\n",
    "        #\n",
    "        from xgboost import XGBRegressor\n",
    "        #\n",
    "        processed_model_input = self.preprocess_input(X_train)\n",
    "        xgb_model = XGBRegressor(**self.params)\n",
    "        xgb_model.fit(processed_model_input, y_train)\n",
    "        #\n",
    "        self.xgb_model = xgb_model\n",
    "    \n",
    "    def predict(self, context, model_input):\n",
    "        \"\"\" This is the main entrance to the model in deployment systems \"\"\"\n",
    "        #\n",
    "        processed_model_input = self.preprocess_input(model_input.copy())\n",
    "        return self.xgb_model.predict(processed_model_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "140323f2-380a-474d-9471-b5c29a8cceb6",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Definition of the parameters for the custom xgboost model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "271fbd54-06ee-403f-a5e5-c1ab7530d971",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "params_xgb = {\n",
    "    \"n_estimators\": 1000, \n",
    "    \"max_depth\": 7,\n",
    "    \"eta\": 0.1,\n",
    "    \"subsample\": 0.7,\n",
    "    \"colsample_bytree\": 0.8\n",
    "}\n",
    "\n",
    "# Designate a path\n",
    "config_path_xgb = \"data_xgb.json\"\n",
    "\n",
    "# Save the results\n",
    "with open(config_path_xgb, \"w\") as f:\n",
    "    json.dump(params_xgb, f)\n",
    "\n",
    "# Generate an artifact object to saved\n",
    "# All paths to the associated values will be copied over when saving\n",
    "artifacts_xgb = {\"config_path\": config_path_xgb} "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "4f7cdfdd-16c2-469b-98c6-c774297afbc1",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Instantiate the xgboost custom model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "7ab28086-0719-4df8-a845-df589b5cd5d5",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out[91]: {'n_estimators': 1000,\n",
      " 'max_depth': 7,\n",
      " 'eta': 0.1,\n",
      " 'subsample': 0.7,\n",
      " 'colsample_bytree': 0.8}"
     ]
    }
   ],
   "source": [
    "model_xgb = xgboost_regressor(params_xgb)\n",
    "#\n",
    "model_xgb.load_context(config_path=config_path_xgb) \n",
    "#\n",
    "# Confirm the config has loaded\n",
    "model_xgb.config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "7ddbe15e-6298-439b-957c-cd5cb210bf9b",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Train the xgboost custom model on training set:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "5efb1de4-19e4-4ffb-afd1-df193368efd7",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "model_xgb.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "e5906970-186c-4790-a258-f1c12cfe98b5",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Verify there can be predictions on test set using xgboost custom model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "f0fbaded-da90-4b38-a4db-7778f4349014",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual prices</th>\n",
       "      <th>predictions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>559</td>\n",
       "      <td>524.269531</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2201</td>\n",
       "      <td>1795.014160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1238</td>\n",
       "      <td>1029.636230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1304</td>\n",
       "      <td>1096.781372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6901</td>\n",
       "      <td>10364.128906</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "arguments": {},
       "data": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>actual prices</th>\n      <th>predictions</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>559</td>\n      <td>524.269531</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2201</td>\n      <td>1795.014160</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1238</td>\n      <td>1029.636230</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1304</td>\n      <td>1096.781372</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>6901</td>\n      <td>10364.128906</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
       "datasetInfos": [],
       "metadata": {},
       "removedWidgets": [],
       "textData": null,
       "type": "htmlSandbox"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "predictions_xgb = model_xgb.predict(context=None, model_input=X_test)\n",
    "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(predictions_xgb)}).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "15084673-a76b-4a3c-810c-dde930326e9f",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Optionally, prepare model signature:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "c8cdbf39-9f42-4649-ad43-81e755a56918",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out[94]: inputs: \n",
      "  ['carat': double, 'depth': double, 'table': double, 'x': double, 'y': double, 'z': double]\n",
      "outputs: \n",
      "  [Tensor('float32', (-1,))]"
     ]
    }
   ],
   "source": [
    "signature_xgb = infer_signature(X_test, predictions_xgb)\n",
    "signature_xgb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "e35d968c-ff8b-4051-a43a-5ca629868e07",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<i>Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use <b>mlflow.sklearn</b>, we automatically log the appropriate version of <b>sklearn</b>. With a <b>pyfunc</b>, we must manually construct our deployment environment. See more details about it in <a href=\"https://customer-academy.databricks.com/learn/course/1522/play/9698/model-management-demo\" target=\"_blank\">this video</a>.</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "cf9a1aed-1f68-42ea-9ce9-61496ef40a05",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out[95]: {'channels': ['defaults'],\n",
      " 'dependencies': ['python=3.9.5',\n",
      "  'pip',\n",
      "  {'pip': ['mlflow', 'xgboost==1.6.2']}],\n",
      " 'name': 'xgboost_env'}"
     ]
    }
   ],
   "source": [
    "conda_env_xgb = {\n",
    "    \"channels\": [\"defaults\"],\n",
    "    \"dependencies\": [\n",
    "        f\"python={version_info.major}.{version_info.minor}.{version_info.micro}\",\n",
    "        \"pip\",\n",
    "        {\"pip\": [\"mlflow\",\n",
    "                 f\"xgboost=={xgboost.__version__}\"]\n",
    "        },\n",
    "    ],\n",
    "    \"name\": \"xgboost_env\"\n",
    "}\n",
    "conda_env_xgb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "cdc88282-cb0c-46e5-8ac3-e4c2c8e37ec0",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Save the model using <code>mlflow.pyfunc.log_model</code> using the parameters defined previously:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "bdea7b0a-ffdf-4c61-9d80-32676c9db9ba",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "with mlflow.start_run() as run:\n",
    "    mlflow.pyfunc.log_model(\n",
    "        \"xgb_regressor\", \n",
    "        python_model=model_xgb, \n",
    "        artifacts=artifacts_xgb,\n",
    "        conda_env=conda_env_xgb,\n",
    "        signature=signature_xgb,\n",
    "        input_example=X_test[:3] \n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "dff26cac-882a-4a38-b16b-1b36ea8529a2",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>It is now possible to load the logged model using <code>mlflow.pyfunc.load_model</code> and use it for predictions:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "5e367849-9034-4355-9405-b7831ecb2912",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual prices</th>\n",
       "      <th>predictions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>559</td>\n",
       "      <td>524.269531</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2201</td>\n",
       "      <td>1795.014160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1238</td>\n",
       "      <td>1029.636230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1304</td>\n",
       "      <td>1096.781372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6901</td>\n",
       "      <td>10364.128906</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "arguments": {},
       "data": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>actual prices</th>\n      <th>predictions</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>559</td>\n      <td>524.269531</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2201</td>\n      <td>1795.014160</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1238</td>\n      <td>1029.636230</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1304</td>\n      <td>1096.781372</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>6901</td>\n      <td>10364.128906</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
       "datasetInfos": [],
       "metadata": {},
       "removedWidgets": [],
       "textData": null,
       "type": "htmlSandbox"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "mlflow_pyfunc_model_path_xgb = f\"runs:/{run.info.run_id}/xgb_regressor\"\n",
    "loaded_preprocess_model_xgb = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path_xgb)\n",
    "#\n",
    "y_pred_xgb = loaded_preprocess_model_xgb.predict(X_test)\n",
    "#\n",
    "pd.DataFrame({'actual prices': list(y_test), 'predictions': list(y_pred_xgb)}).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "612c7afb-1c93-4bf7-94ec-90131208fc3f",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>Let's score RMSE for this model:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "4d58d9dd-4ba0-46d9-8330-074a6b787635",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE for custom xgboost model:  1457.7130185941312\n"
     ]
    }
   ],
   "source": [
    "print(\"RMSE for custom xgboost model: \", mean_squared_error(y_test, y_pred_xgb, squared=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "1dcd5a9c-4a94-4bcb-839c-4d46906250cd",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "<p>It's also possible to load the custom model as a <b>Spark UDF</b> using <code>mlflow.pyfunc.spark_udf</code> and predict in a Spark dataframe:</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "080cb52f-0329-419b-886f-af8cb911f734",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style scoped>\n",
       "  .table-result-container {\n",
       "    max-height: 300px;\n",
       "    overflow: auto;\n",
       "  }\n",
       "  table, th, td {\n",
       "    border: 1px solid black;\n",
       "    border-collapse: collapse;\n",
       "  }\n",
       "  th, td {\n",
       "    padding: 5px;\n",
       "  }\n",
       "  th {\n",
       "    text-align: left;\n",
       "  }\n",
       "</style><div class='table-result-container'><table class='table-result'><thead style='background-color: white'><tr><th>carat</th><th>depth</th><th>table</th><th>x</th><th>y</th><th>z</th><th>prediction</th></tr></thead><tbody><tr><td>0.24</td><td>62.1</td><td>56.0</td><td>3.97</td><td>4.0</td><td>2.47</td><td>818.0107421875</td></tr><tr><td>0.58</td><td>60.0</td><td>57.0</td><td>5.44</td><td>5.42</td><td>3.26</td><td>2764.87451171875</td></tr><tr><td>0.4</td><td>62.1</td><td>55.0</td><td>4.76</td><td>4.74</td><td>2.95</td><td>1567.1573486328125</td></tr><tr><td>0.43</td><td>60.8</td><td>57.0</td><td>4.92</td><td>4.89</td><td>2.98</td><td>1833.7303466796875</td></tr><tr><td>1.55</td><td>62.3</td><td>55.0</td><td>7.44</td><td>7.37</td><td>4.61</td><td>11748.9375</td></tr></tbody></table></div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "aggData": [],
       "aggError": "",
       "aggOverflow": false,
       "aggSchema": [],
       "aggSeriesLimitReached": false,
       "aggType": "",
       "arguments": {},
       "columnCustomDisplayInfos": {},
       "data": [
        [
         0.24,
         62.1,
         56,
         3.97,
         4,
         2.47,
         818.0107421875
        ],
        [
         0.58,
         60,
         57,
         5.44,
         5.42,
         3.26,
         2764.87451171875
        ],
        [
         0.4,
         62.1,
         55,
         4.76,
         4.74,
         2.95,
         1567.1573486328125
        ],
        [
         0.43,
         60.8,
         57,
         4.92,
         4.89,
         2.98,
         1833.7303466796875
        ],
        [
         1.55,
         62.3,
         55,
         7.44,
         7.37,
         4.61,
         11748.9375
        ]
       ],
       "datasetInfos": [],
       "dbfsResultPath": null,
       "isJsonSchema": true,
       "metadata": {},
       "overflow": false,
       "plotOptions": {
        "customPlotOptions": {},
        "displayType": "table",
        "pivotAggregation": null,
        "pivotColumns": null,
        "xColumns": null,
        "yColumns": null
       },
       "removedWidgets": [],
       "schema": [
        {
         "metadata": "{}",
         "name": "carat",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "depth",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "table",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "x",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "y",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "z",
         "type": "\"double\""
        },
        {
         "metadata": "{}",
         "name": "prediction",
         "type": "\"double\""
        }
       ],
       "type": "table"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "xgboost_custom_predict = mlflow.pyfunc.spark_udf(spark, mlflow_pyfunc_model_path_xgb)\n",
    "#\n",
    "display(spark.createDataFrame(X_test).withColumn('prediction', xgboost_custom_predict(*['carat', 'depth', 'table', 'x', 'y', 'z'])).limit(5))"
   ]
  }
 ],
 "metadata": {
  "application/vnd.databricks.v1+notebook": {
   "dashboards": [],
   "language": "python",
   "notebookMetadata": {
    "mostRecentlyExecutedCommandWithImplicitDF": {
     "commandId": -1,
     "dataframes": [
      "_sqldf"
     ]
    },
    "pythonIndentUnit": 2
   },
   "notebookName": "Databricks-ML-professional-S02a-Preprocessing-Logic",
   "widgets": {}
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}