{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845", "showTitle": false, "title": "" } }, "source": [ "
This Notebook adds information related to the following requirements:
Download this notebook at format ipynb here.
\n", "total_bill | tip | sex | smoker | day | time | size |
---|---|---|---|---|---|---|
16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
21.01 | 3.5 | Male | No | Sun | Dinner | 3 |
23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Some transformations are done to prepare dataset to be used to train a ML model.
\n", "column name | \n", "comment | \n", "
---|---|
tip | \n",
" target to predict. Contains numeric | \n", "
total_bill | \n",
" numeric column to keep as is | \n", "
sex | \n",
" Contains Female and Male converted to 0 and 1 | \n",
"
smoker | \n",
" Contains yes and no converted to 0 and 1 | \n",
"
time | \n",
" Contains Dinner and Lunch converted to 0 and 1 | \n",
"
day | \n",
" categorical column to One Hot Encode | \n", "
size | \n",
" categorical column to One Hot Encode | \n", "
\n", " | total_bill | \n", "tip | \n", "sex | \n", "smoker | \n", "time | \n", "day | \n", "size | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "8.77 | \n", "2.00 | \n", "0 | \n", "0 | \n", "1 | \n", "Sun | \n", "2 | \n", "
1 | \n", "9.55 | \n", "1.45 | \n", "0 | \n", "0 | \n", "1 | \n", "Sat | \n", "2 | \n", "
2 | \n", "9.94 | \n", "1.56 | \n", "0 | \n", "0 | \n", "1 | \n", "Sun | \n", "2 | \n", "
3 | \n", "10.27 | \n", "1.71 | \n", "0 | \n", "0 | \n", "1 | \n", "Sun | \n", "2 | \n", "
4 | \n", "10.29 | \n", "2.60 | \n", "1 | \n", "0 | \n", "1 | \n", "Sun | \n", "2 | \n", "
\n | total_bill | \ntip | \nsex | \nsmoker | \ntime | \nday | \nsize | \n
---|---|---|---|---|---|---|---|
0 | \n8.77 | \n2.00 | \n0 | \n0 | \n1 | \nSun | \n2 | \n
1 | \n9.55 | \n1.45 | \n0 | \n0 | \n1 | \nSat | \n2 | \n
2 | \n9.94 | \n1.56 | \n0 | \n0 | \n1 | \nSun | \n2 | \n
3 | \n10.27 | \n1.71 | \n0 | \n0 | \n1 | \nSun | \n2 | \n
4 | \n10.29 | \n2.60 | \n1 | \n0 | \n1 | \nSun | \n2 | \n
It is possible to log to mlflow using nested runs:
\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "e5839d28-4117-400d-9a8c-d7fa5fbd0665", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "with mlflow.start_run(run_name=\"tips_evaluation\") as run_parent:\n", " #\n", " # loop on the three regression models\n", " for regression_model in [glr, lrm, fmr]:\n", " #\n", " # get model name\n", " model_name = regression_model.__str__().split(\"_\")[0]\n", " #\n", " # Nest mlflow logging\n", " with mlflow.start_run(run_name=model_name, nested=True) as run:\n", " #\n", " # define pipeline stages according to model\n", " stages = [string_indexer, ohe, vec_assembler, regression_model]\n", " #\n", " # set pipeline\n", " pipeline = Pipeline(stages=stages)\n", " #\n", " # fit pipeline to train set\n", " model = pipeline.fit(train_df)\n", " #\n", " # log model to mlflow\n", " mlflow.spark.log_model(model, model_name, signature=signature, input_example=input_example)\n", " #\n", " # predict test set\n", " pred_df = model.transform(test_df)\n", " #\n", " # evaluate prediction\n", " rmse = evaluator.evaluate(pred_df)\n", " #\n", " # log evaluation to mlflow\n", " mlflow.log_metric(\"rmse\", rmse)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "f04a8cf6-a501-4e11-a7af-66b9b9bd6744", "showTitle": false, "title": "" } }, "source": [ "autolog()
and train a simple model. This will automatically log everything possible for each library used.Now let's fit and evaluate a model:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "2e9b4b6e-18be-4c01-ac94-ddfb7263b97b", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[12]: 1.3054572176798678" ] } ], "source": [ "# fit pipeline to train set\n", "model_lrm_autolog = pipeline.fit(train_df)\n", "#\n", "# predict test set\n", "pred_df = model_lrm_autolog.transform(test_df)\n", "#\n", "# evaluate\n", "evaluator.evaluate(pred_df)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "431483d6-48ce-4392-87c2-95dabcfd87c8", "showTitle": false, "title": "" } }, "source": [ "After that, in MLflow UI, we can see the many parameters that have been logged.
\n", "Alternatively, we can get and see the logged parameters for latest run programmaticaly:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "eb984d77-72af-4a8b-8ae7-14188488963a", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "run_id | experiment_id | status | artifact_uri | start_time | end_time | metrics.rmse_test_df | metrics.rmse | metrics.rmse_unknown_dataset | params.FMRegressor.fitIntercept | params.FMRegressor.maxIter | params.OneHotEncoder.outputCol | params.OneHotEncoder.outputCols | params.stages | params.StringIndexer.stringOrderType | params.StringIndexer.inputCols | params.FMRegressor.tol | params.StringIndexer.outputCols | params.StringIndexer.handleInvalid | params.FMRegressor.solver | params.FMRegressor.factorSize | params.OneHotEncoder.handleInvalid | params.OneHotEncoder.inputCols | params.StringIndexer.outputCol | params.FMRegressor.fitLinear | params.FMRegressor.miniBatchFraction | params.VectorAssembler.handleInvalid | params.VectorAssembler.inputCols | params.FMRegressor.predictionCol | params.FMRegressor.regParam | params.FMRegressor.labelCol | params.FMRegressor.featuresCol | params.FMRegressor.initStd | params.FMRegressor.stepSize | params.OneHotEncoder.dropLast | params.FMRegressor.seed | params.VectorAssembler.outputCol | params.maxIter | params.LinearRegression.maxIter | params.LinearRegression.standardization | params.LinearRegression.tol | params.LinearRegression.solver | params.LinearRegression.elasticNetParam | params.LinearRegression.maxBlockSizeInMB | params.LinearRegression.featuresCol | params.LinearRegression.labelCol | params.LinearRegression.fitIntercept | params.LinearRegression.aggregationDepth | params.LinearRegression.loss | params.LinearRegression.predictionCol | params.LinearRegression.epsilon | params.LinearRegression.regParam | tags.mlflow.databricks.cluster.id | tags.mlflow.databricks.cluster.libraries.error | tags.mlflow.databricks.notebookRevisionID | tags.mlflow.databricks.workspaceID | tags.mlflow.databricks.notebook.commandID | tags.mlflow.source.type | tags.mlflow.databricks.webappURL | tags.mlflow.runName | tags.estimator_class | tags.mlflow.autologging | tags.mlflow.databricks.notebookID | tags.estimator_name | tags.mlflow.parentRunId | tags.mlflow.rootRunId | tags.mlflow.log-model.history |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
08b972964438470595f2ba9ba6aa9d40 | 3541968995997190 | FINISHED | dbfs:/databricks/mlflow-tracking/3541968995997190/08b972964438470595f2ba9ba6aa9d40/artifacts | 2023-11-22T16:55:45.974+0000 | 2023-11-22T16:55:56.280+0000 | 1.3054572176798678 | null | null | True | 100 | OneHotEncoder_c95dbc53b9cc__output | ['size_ohe', 'day_ohe'] | ['StringIndexer', 'OneHotEncoder', 'VectorAssembler', 'FMRegressor'] | frequencyDesc | ['size', 'day'] | 1e-06 | ['size_index', 'day_index'] | skip | adamW | 8 | error | ['size_index', 'day_index'] | StringIndexer_c3caebc64717__output | True | 1.0 | error | ['size_ohe', 'day_ohe', 'total_bill', 'sex', 'smoker', 'time'] | prediction | 0.0 | tip | features | 0.01 | 0.001 | True | -2921654334123211668 | features | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 1027-081006-5cgi5kuh | This message class grpc_shaded.com.databricks.api.proto.managedLibraries.ClusterStatus DID NOT match any methods in the stub class grpc_shaded.com.databricks.api.proto.cluster.ClusterServiceGrpc$ClusterServiceBlockingStub | 1700672156611 | 3607579860940718 | 7308506017976005609_4719503818729863905_49d8e6f9a405484da1266fc91cafd976 | NOTEBOOK | https://eastus-c3.azuredatabricks.net | colorful-snake-723 | pyspark.ml.pipeline.Pipeline | pyspark.ml | 3541968995997190 | Pipeline | null | null | null |
autolog()
to log everything.HyperOpt:
\n", "Let's define the hyperparameter search spaces:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "8edfe198-1d5f-46fc-af5e-e9cd10bcc14f", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "search_spaces = {\"maxIter\": hp.quniform(\"maxIter\", 1, 100, 1),\n", " \"regParam\": hp.uniform(\"regParam\", 0.1, 10),\n", " \"elasticNetParam\": hp.uniform(\"elasticNetParam\", 0, 1)}" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "e1a9966d-c2da-46bb-a251-06213916ec80", "showTitle": false, "title": "" } }, "source": [ "Finally let's run the hyperparameter tuning with HyperOpt:
\n", "As we are using a model from MLlib, we are going to use Trials
class as value for trials
parameter of the fmin
function.
See also this page or this video to learn more on HyperOpt.
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "3d035eb8-cc74-47d6-aa7d-469b39fcb013", "showTitle": false, "title": "" } }, "source": [ "\n", "Looks like logging SHAP - SHapley Additive exPlanations - works with scikit-learn. So let's quickly train a model with scikit-learn library. For simplicity, let's keep day
and time
features out.
Here is an example of logging SHAP to mlflow:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "9574a53e-48ab-429d-9315-a730e3c45bf4", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "56a5641835084ae89614b0a460224285", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/81 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with mlflow.start_run(run_name=\"shap_tips\"):\n", " mlflow.shap.log_explanation(fitted_rfr_model.predict, pd_df_X_test)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c190cf02-6531-4f36-bb1c-396e69864287", "showTitle": false, "title": "" } }, "source": [ "Here is an example of logging figure to mlflow:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "5e0431f5-6c8f-4c5a-a639-d60e8c780bbe", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "image/png": "\n" }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "\n", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "type": "image" } }, "output_type": "display_data" } ], "source": [ "with mlflow.start_run(run_name=\"figure_tips\"):\n", " #\n", " # Generate feature importance plot thanks to feature_importances_ attribute of the RandomForestRegressor model\n", " feature_importances = pd.Series(fitted_rfr_model.feature_importances_, index=pd_df_X_train.columns)\n", " fig, ax = plt.subplots()\n", " feature_importances.plot.bar(ax=ax)\n", " ax.set_title(\"Feature importances using MDI\")\n", " ax.set_ylabel(\"Mean decrease in impurity\")\n", " #\n", " # Log figure to mlflow\n", " mlflow.log_figure(fig, \"feature_importance_rf.png\")" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "mostRecentlyExecutedCommandWithImplicitDF": { "commandId": 1774797690553258, "dataframes": [ "_sqldf" ] }, "pythonIndentUnit": 2 }, "notebookName": "Databricks-ML-professional-S01c-Advanced-Experiment-Tracking", "widgets": {} }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }