Databricks-ML-professional-S04b-Drift-Tests-and-Monitoring
This Notebook adds information related to the following requirements:
Drift Tests and Monitoring:
- Describe summary statistic monitoring as a simple solution for numeric feature drift
- Describe mode, unique values, and missing values as simple solutions for categorical feature drift
- Describe tests as more robust monitoring solutions for numeric feature drift than simple summary statistics
- Describe tests as more robust monitoring solutions for categorical feature drift than simple summary statistics
- Compare and contrast Jenson-Shannon divergence and Kolmogorov-Smirnov tests for numerical drift detection
- Identify a scenario in which a chi-square test would be useful
Download this notebook at format ipynb here.
Summary statistic monitoring is a straightforward approach to detect numeric feature drift. The idea is to calculate summary statistics (such as mean, standard deviation, minimum, maximum, etc.) for each numeric feature in the training data and then compare these statistics with the summary statistics of the incoming data in the production environment. Deviations from the expected summary statistics can indicate feature drift.
- Mode Monitoring:
- Definition: The mode of a categorical feature is the value that appears most frequently.
- Implementation: During the training phase, identify the mode of each categorical feature in the training dataset. In the production environment, monitor the mode of each categorical feature in the incoming data. If the mode shifts significantly, it could indicate a change in the distribution of categories, suggesting possible drift.
- Unique Values Monitoring:
- Definition: The set of unique values in a categorical feature represents the different categories present.
- Implementation: Calculate the unique values of each categorical feature in the training data. Continuously monitor the unique values of each categorical feature in the production data. If new, unexpected categories appear or if existing categories disappear, it may indicate drift.
- Missing Values Monitoring:
- Definition: Changes in the frequency of missing values can also indicate drift.
- Implementation: Record the percentage of missing values for each categorical feature during training. Monitor the percentage of missing values for each categorical feature in the production data. Drift may be present if the rate of missing values changes significantly.
Tests offer more robust solutions for monitoring numeric feature drift than simple summary statistics. Instead of relying solely on mean or standard deviation, statistical tests provide a formalized way to assess the significance of differences in feature distributions. For instance, the Kolmogorov-Smirnov test or the Anderson-Darling test can compare the cumulative distribution functions of training and production data. These tests consider the entire distribution, making them sensitive to subtle shifts. Additionally, the Cramér-von Mises test can evaluate differences in distribution shapes, offering a more nuanced analysis. Implementing these tests allows for a systematic and statistical approach to detect numeric feature drift, enhancing the model's adaptability to evolving data patterns in a production environment.
Tests provide robust solutions for monitoring categorical feature drift compared to simple summary statistics. Rather than relying solely on mode or unique values, statistical tests offer a more formalized approach. For instance, the chi-squared test assesses the independence of observed and expected categorical distributions, indicating if there are significant deviations. This test is particularly valuable when dealing with multiple categories. Another option is the G-test, which is an extension of the chi-squared test and is suitable for smaller sample sizes. By employing these statistical tests, it becomes possible to systematically identify shifts in categorical feature distributions between training and production data, allowing for a more nuanced and reliable detection of drift in real-world scenarios.
The Jenson-Shannon Divergence (JSD) and Kolmogorov-Smirnov (KS) test are both methods for detecting numerical drift, but they operate on different principles.
Jenson-Shannon Divergence (JSD): JSD measures the similarity between two probability distributions by computing the divergence between their probability mass functions. In drift detection, JSD can quantify the difference in probability distributions of numeric features between training and production data. It considers the entire distribution, providing a comprehensive analysis. However, it requires a smooth distribution and may be sensitive to outliers.
Jensen Shannon (JS) distance is more appropriate for drift detection on a large dataset since it meaures the distance between two probability distributions and it is smoothed and normalized. When log base 2 is used for the distance calculation, the JS statistic is bounded between 0 and 1:
- 0 means the distributions are identical
- 1 means the distributions have no similarity
Kolmogorov-Smirnov Test (KS): KS test assesses the similarity of two cumulative distribution functions (CDFs) and is sensitive to differences anywhere in the distribution. It calculates the maximum vertical distance between the CDFs, providing a simple and non-parametric measure. KS is less affected by outliers but might be influenced by sample size.
This test determines whether or not two different samples come from the same distribution.
- Returns a higher KS statistic when there is a higher probability of having two different distributions
- Returns a lower P value the higher the statistical significance
- In practice, we need a threshold for the p-value, where we will consider it unlikely enough that the samples did not come from the same distribution. Usually this threshold, or alpha level, is 0.05.
In summary, JSD is more comprehensive and suitable for smooth distributions, while KS is robust, especially against outliers, but its sensitivity to sample size should be considered. The choice depends on the specific characteristics of the data and the desired balance between sensitivity and robustness.
A chi-square test would be useful in scenarios involving categorical data and the need to assess the independence or association between two categorical variables. One prominent example is in medical research when investigating the relationship between smoking status (categories: smoker, non-smoker) and the incidence of a specific health outcome (categories: presence, absence).
Consider a clinical study aiming to understand whether there is a significant association between smoking habits and the development of a particular respiratory condition. Researchers collect data on a sample of individuals, categorizing them based on smoking status and the presence or absence of the respiratory condition. By applying the chi-square test, they can analyze the observed and expected frequencies in a contingency table, determining whether any observed associations are statistically significant or if they could have occurred by chance.
The chi-square test provides a valuable statistical tool in such scenarios, helping researchers draw conclusions about the independence of variables and contributing insights into potential causal relationships, ultimately informing public health strategies or medical interventions.