{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845", "showTitle": false, "title": "" } }, "source": [ "
This Notebook adds information related to the following requirements:
Download this notebook at format ipynb here.
\n", "Summary statistic monitoring is a straightforward approach to detect numeric feature drift. The idea is to calculate summary statistics (such as mean, standard deviation, minimum, maximum, etc.) for each numeric feature in the training data and then compare these statistics with the summary statistics of the incoming data in the production environment. Deviations from the expected summary statistics can indicate feature drift.
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "18e681ce-93ed-4c38-814e-6d851bb56281", "showTitle": false, "title": "" } }, "source": [ "\n", "Tests offer more robust solutions for monitoring numeric feature drift than simple summary statistics. Instead of relying solely on mean or standard deviation, statistical tests provide a formalized way to assess the significance of differences in feature distributions. For instance, the Kolmogorov-Smirnov test or the Anderson-Darling test can compare the cumulative distribution functions of training and production data. These tests consider the entire distribution, making them sensitive to subtle shifts. Additionally, the Cramér-von Mises test can evaluate differences in distribution shapes, offering a more nuanced analysis. Implementing these tests allows for a systematic and statistical approach to detect numeric feature drift, enhancing the model's adaptability to evolving data patterns in a production environment.
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "f5bc2486-22b4-4463-a8f7-9cacc347db73", "showTitle": false, "title": "" } }, "source": [ "\n", "Tests provide robust solutions for monitoring categorical feature drift compared to simple summary statistics. Rather than relying solely on mode or unique values, statistical tests offer a more formalized approach. For instance, the chi-squared test assesses the independence of observed and expected categorical distributions, indicating if there are significant deviations. This test is particularly valuable when dealing with multiple categories. Another option is the G-test, which is an extension of the chi-squared test and is suitable for smaller sample sizes. By employing these statistical tests, it becomes possible to systematically identify shifts in categorical feature distributions between training and production data, allowing for a more nuanced and reliable detection of drift in real-world scenarios.
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "f958033b-36f3-41c4-b525-7eeb4e168a42", "showTitle": false, "title": "" } }, "source": [ "\n", "The Jenson-Shannon Divergence (JSD) and Kolmogorov-Smirnov (KS) test are both methods for detecting numerical drift, but they operate on different principles.
\n", "Jenson-Shannon Divergence (JSD):\n", "JSD measures the similarity between two probability distributions by computing the divergence between their probability mass functions. In drift detection, JSD can quantify the difference in probability distributions of numeric features between training and production data. It considers the entire distribution, providing a comprehensive analysis. However, it requires a smooth distribution and may be sensitive to outliers.
\n", "\n", "Jensen Shannon (JS) distance is more appropriate for drift detection on a large dataset since it meaures the distance between two probability distributions and it is smoothed and normalized. When log base 2 is used for the distance calculation, the JS statistic is bounded between 0 and 1:\n", "\n", "
Kolmogorov-Smirnov Test (KS):\n", "KS test assesses the similarity of two cumulative distribution functions (CDFs) and is sensitive to differences anywhere in the distribution. It calculates the maximum vertical distance between the CDFs, providing a simple and non-parametric measure. KS is less affected by outliers but might be influenced by sample size.
This test determines whether or not two different samples come from the same distribution.
\n", "In summary, JSD is more comprehensive and suitable for smooth distributions, while KS is robust, especially against outliers, but its sensitivity to sample size should be considered. The choice depends on the specific characteristics of the data and the desired balance between sensitivity and robustness.
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "78c6589c-d9e1-4dcd-b564-0fa83dd87d3b", "showTitle": false, "title": "" } }, "source": [ "\n", "A chi-square test would be useful in scenarios involving categorical data and the need to assess the independence or association between two categorical variables. One prominent example is in medical research when investigating the relationship between smoking status (categories: smoker, non-smoker) and the incidence of a specific health outcome (categories: presence, absence).
\n", "\n", "Consider a clinical study aiming to understand whether there is a significant association between smoking habits and the development of a particular respiratory condition. Researchers collect data on a sample of individuals, categorizing them based on smoking status and the presence or absence of the respiratory condition. By applying the chi-square test, they can analyze the observed and expected frequencies in a contingency table, determining whether any observed associations are statistically significant or if they could have occurred by chance.
\n", "\n", "The chi-square test provides a valuable statistical tool in such scenarios, helping researchers draw conclusions about the independence of variables and contributing insights into potential causal relationships, ultimately informing public health strategies or medical interventions.
" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "mostRecentlyExecutedCommandWithImplicitDF": { "commandId": 1158789969180638, "dataframes": [ "_sqldf" ] }, "pythonIndentUnit": 2 }, "notebookName": "Databricks-ML-professional-S04b-Drift-Tests-and-Monitoring", "widgets": {} }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }