{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845", "showTitle": false, "title": "" } }, "source": [ "

Databricks-ML-professional-S01a-Data-management

" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "0d1a80dc-78e8-4e48-b2eb-04848f9ee522", "showTitle": false, "title": "" } }, "source": [ "
\n", "
\n", "

This Notebook adds information related to the following requirements:


\n", "Data Management:\n", "\n", "
\n", "

Download this notebook at format ipynb here.

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "b5f6d0da-1d81-4fa0-9770-a9e4d6863534", "showTitle": false, "title": "" } }, "source": [ "
\n", "1. Import libraries
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "8a2d2e59-7426-4d5f-8d97-3dcff6e5151d", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "#\n", "from pyspark.sql.functions import *\n", "#\n", "from databricks.feature_store import FeatureStoreClient, feature_table" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "aa08db2c-a856-4c86-81fe-9a8b7322cd6a", "showTitle": false, "title": "" } }, "source": [ "
\n", "2. Load dataset, convert to Spark DataFrame
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "5b64ff08-1603-4d0c-bc4e-19c0094c3b9c", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
pickupdropoffpassengersdistancefaretiptollstotalcolorpaymentpickup_zonedropoff_zonepickup_boroughdropoff_borough
2019-03-23 20:21:092019-03-23 20:27:2411.67.02.150.012.95yellowcredit cardLenox Hill WestUN/Turtle Bay SouthManhattanManhattan
2019-03-04 16:11:552019-03-04 16:19:0010.795.00.00.09.3yellowcashUpper West Side SouthUpper West Side SouthManhattanManhattan
2019-03-27 17:53:012019-03-27 18:00:2511.377.52.360.014.16yellowcredit cardAlphabet CityWest VillageManhattanManhattan
2019-03-10 01:23:592019-03-10 01:49:5117.727.06.150.036.95yellowcredit cardHudson SqYorkville WestManhattanManhattan
2019-03-30 13:27:422019-03-30 13:37:1432.169.01.10.013.4yellowcredit cardMidtown EastYorkville WestManhattanManhattan
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ "2019-03-23 20:21:09", "2019-03-23 20:27:24", 1, 1.6, 7, 2.15, 0, 12.95, "yellow", "credit card", "Lenox Hill West", "UN/Turtle Bay South", "Manhattan", "Manhattan" ], [ "2019-03-04 16:11:55", "2019-03-04 16:19:00", 1, 0.79, 5, 0, 0, 9.3, "yellow", "cash", "Upper West Side South", "Upper West Side South", "Manhattan", "Manhattan" ], [ "2019-03-27 17:53:01", "2019-03-27 18:00:25", 1, 1.37, 7.5, 2.36, 0, 14.16, "yellow", "credit card", "Alphabet City", "West Village", "Manhattan", "Manhattan" ], [ "2019-03-10 01:23:59", "2019-03-10 01:49:51", 1, 7.7, 27, 6.15, 0, 36.95, "yellow", "credit card", "Hudson Sq", "Yorkville West", "Manhattan", "Manhattan" ], [ "2019-03-30 13:27:42", "2019-03-30 13:37:14", 3, 2.16, 9, 1.1, 0, 13.4, "yellow", "credit card", "Midtown East", "Yorkville West", "Manhattan", "Manhattan" ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "pickup", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff", "type": "\"string\"" }, { "metadata": "{}", "name": "passengers", "type": "\"long\"" }, { "metadata": "{}", "name": "distance", "type": "\"double\"" }, { "metadata": "{}", "name": "fare", "type": "\"double\"" }, { "metadata": "{}", "name": "tip", "type": "\"double\"" }, { "metadata": "{}", "name": "tolls", "type": "\"double\"" }, { "metadata": "{}", "name": "total", "type": "\"double\"" }, { "metadata": "{}", "name": "color", "type": "\"string\"" }, { "metadata": "{}", "name": "payment", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_borough", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_borough", "type": "\"string\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "taxis_df = sns.load_dataset(\"taxis\")\n", "#\n", "taxis_sdf = spark.createDataFrame(taxis_df)\n", "#\n", "display(taxis_sdf.limit(5))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "2b595b34-0633-4f66-9ca0-6067f4cc0716", "showTitle": false, "title": "" } }, "source": [ "\n", "
\n", "3. Read and write a Delta table
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "453316e6-0dc3-41b0-9730-27c39ed9bdf1", "showTitle": false, "title": "" } }, "source": [ "

Dataframes are saved by default as managed delta tables when the location is not specified:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "92c6fbbf-0a08-4fee-8ad7-abdf5a0f9ea4", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .saveAsTable(\"taxis_sdf\"))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bd1a26b8-58ce-421a-82a1-420786c4b143", "showTitle": false, "title": "" } }, "source": [ "

When location is specified, they are external delta tables:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c0a1d573-d054-48bb-864a-fb9eab2efaa3", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .save(\"/temp/taxis_sdf\"))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "82818b27-26e1-47fe-8130-9512f505564b", "showTitle": false, "title": "" } }, "source": [ "

Managed delta tables can be listed with the following command:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "2373152d-1734-43b0-9f48-db06fd29b65a", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
databasetableNameisTemporary
defaultamsterdam_airbnb_dffalse
defaultcsv_studfalse
defaultdiamondsfalse
defaultdiamonds_df_not_partitionedfalse
defaultdiamonds_df_partitionedfalse
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ "default", "amsterdam_airbnb_df", false ], [ "default", "csv_stud", false ], [ "default", "diamonds", false ], [ "default", "diamonds_df_not_partitioned", false ], [ "default", "diamonds_df_partitioned", false ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "database", "type": "\"string\"" }, { "metadata": "{}", "name": "tableName", "type": "\"string\"" }, { "metadata": "{}", "name": "isTemporary", "type": "\"boolean\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "display(spark.sql(\"show tables\").limit(5))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "3a885b19-f56f-4d8e-9536-9100cdccdd40", "showTitle": false, "title": "" } }, "source": [ "

Delta tables can be read this way:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "26108474-ec39-4c7c-9e04-d91d52ebc434", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "taxis_sdf = spark.table(\"taxis_sdf\") # read managed delta table\n", "taxi_sdf_ext = spark.read.load(\"/temp/taxis_sdf\") # read external delta table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "34ff3eea-821d-4915-8de5-c2ef8da0144c", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
pickupdropoffpassengersdistancefaretiptollstotalcolorpaymentpickup_zonedropoff_zonepickup_boroughdropoff_borough
2019-03-23 20:21:092019-03-23 20:27:2411.67.02.150.012.95yellowcredit cardLenox Hill WestUN/Turtle Bay SouthManhattanManhattan
2019-03-04 16:11:552019-03-04 16:19:0010.795.00.00.09.3yellowcashUpper West Side SouthUpper West Side SouthManhattanManhattan
2019-03-27 17:53:012019-03-27 18:00:2511.377.52.360.014.16yellowcredit cardAlphabet CityWest VillageManhattanManhattan
2019-03-10 01:23:592019-03-10 01:49:5117.727.06.150.036.95yellowcredit cardHudson SqYorkville WestManhattanManhattan
2019-03-30 13:27:422019-03-30 13:37:1432.169.01.10.013.4yellowcredit cardMidtown EastYorkville WestManhattanManhattan
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ "2019-03-23 20:21:09", "2019-03-23 20:27:24", 1, 1.6, 7, 2.15, 0, 12.95, "yellow", "credit card", "Lenox Hill West", "UN/Turtle Bay South", "Manhattan", "Manhattan" ], [ "2019-03-04 16:11:55", "2019-03-04 16:19:00", 1, 0.79, 5, 0, 0, 9.3, "yellow", "cash", "Upper West Side South", "Upper West Side South", "Manhattan", "Manhattan" ], [ "2019-03-27 17:53:01", "2019-03-27 18:00:25", 1, 1.37, 7.5, 2.36, 0, 14.16, "yellow", "credit card", "Alphabet City", "West Village", "Manhattan", "Manhattan" ], [ "2019-03-10 01:23:59", "2019-03-10 01:49:51", 1, 7.7, 27, 6.15, 0, 36.95, "yellow", "credit card", "Hudson Sq", "Yorkville West", "Manhattan", "Manhattan" ], [ "2019-03-30 13:27:42", "2019-03-30 13:37:14", 3, 2.16, 9, 1.1, 0, 13.4, "yellow", "credit card", "Midtown East", "Yorkville West", "Manhattan", "Manhattan" ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "pickup", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff", "type": "\"string\"" }, { "metadata": "{}", "name": "passengers", "type": "\"long\"" }, { "metadata": "{}", "name": "distance", "type": "\"double\"" }, { "metadata": "{}", "name": "fare", "type": "\"double\"" }, { "metadata": "{}", "name": "tip", "type": "\"double\"" }, { "metadata": "{}", "name": "tolls", "type": "\"double\"" }, { "metadata": "{}", "name": "total", "type": "\"double\"" }, { "metadata": "{}", "name": "color", "type": "\"string\"" }, { "metadata": "{}", "name": "payment", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_borough", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_borough", "type": "\"string\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "display(taxi_sdf_ext.limit(5))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "50a866b2-29e2-4dad-8c70-0ed707a3202d", "showTitle": false, "title": "" } }, "source": [ "\n", "
\n", "4. View Delta table history and load a previous version of a Delta table
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "cd7d2a99-f5bb-496f-9247-3376fb9c49bc", "showTitle": false, "title": "" } }, "source": [ "

Let's do a small modification to the table and save it again:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "767c2752-5c38-4578-b314-d8046855e60d", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# remove nulls for payment column\n", "taxis_sdf = taxis_sdf.filter(\"payment is not null\")\n", "#\n", "# save the change to managed delta table\n", "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .saveAsTable(\"taxis_sdf\"))\n", "#\n", "# save the change to external delta table\n", "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .save(\"/temp/taxis_sdf\"))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "47f64edc-5bc2-40e3-bf8c-9dbd6a5431f8", "showTitle": false, "title": "" } }, "source": [ "

See changes in managed table:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "implicitDf": true, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "dff5f4fd-eb40-4863-a205-4d7b322476a9", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "spark.sql(\"DESCRIBE HISTORY taxis_sdf\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "2633c9ea-fde4-4c51-bec7-743f315852b4", "showTitle": false, "title": "" } }, "source": [ "

See changes in external table:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "implicitDf": true, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "9c6fc2d6-73b3-4826-b96b-b2f6b1414b3c", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "spark.sql(\"DESCRIBE HISTORY '/temp/taxis_sdf/'\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "b504e861-437e-4e16-a25e-eda23b7da284", "showTitle": false, "title": "" } }, "source": [ "

The following command lets load a previous version. We see that loading version 0 of table get null values for column payment back:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "7b6853f6-1fd8-41af-bd52-1808086d415e", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
pickupdropoffpassengersdistancefaretiptollstotalcolorpaymentpickup_zonedropoff_zonepickup_boroughdropoff_borough
2019-03-08 02:56:382019-03-08 03:07:2412.410.50.00.014.3yellownullMurray HillWest VillageManhattanManhattan
2019-03-02 19:01:362019-03-02 19:08:4601.47.00.00.010.3yellownullUpper East Side SouthMurray HillManhattanManhattan
2019-03-23 11:07:112019-03-23 11:32:4610.019.00.00.022.3yellownullWest VillageUpper East Side SouthManhattanManhattan
2019-03-15 00:10:382019-03-15 00:21:3913.112.50.00.016.3yellownullWest Chelsea/Hudson YardsUpper West Side SouthManhattanManhattan
2019-03-30 11:38:202019-03-30 11:40:4510.23.50.00.06.8yellownullUpper West Side NorthUpper West Side NorthManhattanManhattan
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ "2019-03-08 02:56:38", "2019-03-08 03:07:24", 1, 2.4, 10.5, 0, 0, 14.3, "yellow", null, "Murray Hill", "West Village", "Manhattan", "Manhattan" ], [ "2019-03-02 19:01:36", "2019-03-02 19:08:46", 0, 1.4, 7, 0, 0, 10.3, "yellow", null, "Upper East Side South", "Murray Hill", "Manhattan", "Manhattan" ], [ "2019-03-23 11:07:11", "2019-03-23 11:32:46", 1, 0, 19, 0, 0, 22.3, "yellow", null, "West Village", "Upper East Side South", "Manhattan", "Manhattan" ], [ "2019-03-15 00:10:38", "2019-03-15 00:21:39", 1, 3.1, 12.5, 0, 0, 16.3, "yellow", null, "West Chelsea/Hudson Yards", "Upper West Side South", "Manhattan", "Manhattan" ], [ "2019-03-30 11:38:20", "2019-03-30 11:40:45", 1, 0.2, 3.5, 0, 0, 6.8, "yellow", null, "Upper West Side North", "Upper West Side North", "Manhattan", "Manhattan" ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "pickup", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff", "type": "\"string\"" }, { "metadata": "{}", "name": "passengers", "type": "\"long\"" }, { "metadata": "{}", "name": "distance", "type": "\"double\"" }, { "metadata": "{}", "name": "fare", "type": "\"double\"" }, { "metadata": "{}", "name": "tip", "type": "\"double\"" }, { "metadata": "{}", "name": "tolls", "type": "\"double\"" }, { "metadata": "{}", "name": "total", "type": "\"double\"" }, { "metadata": "{}", "name": "color", "type": "\"string\"" }, { "metadata": "{}", "name": "payment", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_borough", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_borough", "type": "\"string\"" } ], "type": "table" } }, "output_type": "display_data" }, { "data": { "text/html": [ "
pickupdropoffpassengersdistancefaretiptollstotalcolorpaymentpickup_zonedropoff_zonepickup_boroughdropoff_borough
2019-03-08 02:56:382019-03-08 03:07:2412.410.50.00.014.3yellownullMurray HillWest VillageManhattanManhattan
2019-03-02 19:01:362019-03-02 19:08:4601.47.00.00.010.3yellownullUpper East Side SouthMurray HillManhattanManhattan
2019-03-23 11:07:112019-03-23 11:32:4610.019.00.00.022.3yellownullWest VillageUpper East Side SouthManhattanManhattan
2019-03-15 00:10:382019-03-15 00:21:3913.112.50.00.016.3yellownullWest Chelsea/Hudson YardsUpper West Side SouthManhattanManhattan
2019-03-30 11:38:202019-03-30 11:40:4510.23.50.00.06.8yellownullUpper West Side NorthUpper West Side NorthManhattanManhattan
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ "2019-03-08 02:56:38", "2019-03-08 03:07:24", 1, 2.4, 10.5, 0, 0, 14.3, "yellow", null, "Murray Hill", "West Village", "Manhattan", "Manhattan" ], [ "2019-03-02 19:01:36", "2019-03-02 19:08:46", 0, 1.4, 7, 0, 0, 10.3, "yellow", null, "Upper East Side South", "Murray Hill", "Manhattan", "Manhattan" ], [ "2019-03-23 11:07:11", "2019-03-23 11:32:46", 1, 0, 19, 0, 0, 22.3, "yellow", null, "West Village", "Upper East Side South", "Manhattan", "Manhattan" ], [ "2019-03-15 00:10:38", "2019-03-15 00:21:39", 1, 3.1, 12.5, 0, 0, 16.3, "yellow", null, "West Chelsea/Hudson Yards", "Upper West Side South", "Manhattan", "Manhattan" ], [ "2019-03-30 11:38:20", "2019-03-30 11:40:45", 1, 0.2, 3.5, 0, 0, 6.8, "yellow", null, "Upper West Side North", "Upper West Side North", "Manhattan", "Manhattan" ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "pickup", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff", "type": "\"string\"" }, { "metadata": "{}", "name": "passengers", "type": "\"long\"" }, { "metadata": "{}", "name": "distance", "type": "\"double\"" }, { "metadata": "{}", "name": "fare", "type": "\"double\"" }, { "metadata": "{}", "name": "tip", "type": "\"double\"" }, { "metadata": "{}", "name": "tolls", "type": "\"double\"" }, { "metadata": "{}", "name": "total", "type": "\"double\"" }, { "metadata": "{}", "name": "color", "type": "\"string\"" }, { "metadata": "{}", "name": "payment", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_borough", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_borough", "type": "\"string\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "display(spark.read\n", " .format(\"delta\")\n", " .option(\"versionAsOf\", 0)\n", " .table(\"taxis_sdf\")\n", " .filter(\"payment is null\")\n", " .limit(5))\n", "#\n", "# Or for an external delta table\n", "display(spark.read\n", " .format(\"delta\")\n", " .option(\"versionAsOf\", 0)\n", " .load(\"/temp/taxis_sdf\")\n", " .filter(\"payment is null\")\n", " .limit(5))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "79ed4d6b-d326-4bd9-a5e3-518cec35431e", "showTitle": false, "title": "" } }, "source": [ "\n", "
\n", "5. Create, overwrite, merge and read Feature Store tables in machine learning workflows
" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "9a29266d-e588-4b2d-b9bd-8d47b71113a2", "showTitle": false, "title": "" } }, "source": [ "

Let's add an index to the table:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c306d83e-f531-43a7-9204-47251bdf3691", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "taxis_sdf = taxis_sdf.withColumn(\"id\", monotonically_increasing_id())" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "4bad54d4-0ad6-40aa-ae13-691ca37de84f", "showTitle": false, "title": "" } }, "source": [ "

A Catalog and schema must exist. Then if one them is missing, it is necessary to create it/them. For the example here, let's use the default existing Catalog hive_metastore and create a Schema in it:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "450bbaf3-2e7a-4a1c-8821-abb5ee620018", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out[77]: DataFrame[]" ] } ], "source": [ "# Create a catalog\n", "spark.sql(\"USE CATALOG hive_metastore\")\n", "spark.sql(\"CREATE SCHEMA IF NOT EXISTS seaborn_db\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "a8c633fd-c3a5-4a7c-92e7-852bc5c9a401", "showTitle": false, "title": "" } }, "source": [ "

The following command is to create and populate a feature store table in schema seaborn_df from hive_metastore Catalog:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "40c131f0-f163-4e6e-9994-6d5f7e40deae", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023/11/22 16:27:14 WARNING databricks.feature_store._compute_client._compute_client: The feature table \"hive_metastore.seaborn_db.taxidataset\" already exists. Use \"FeatureStoreClient.write_table\" API to write to the feature table.\n" ] } ], "source": [ "fs = FeatureStoreClient()\n", "#\n", "customer_feature_table = fs.create_table(\n", " name='hive_metastore.seaborn_db.taxidataset', # format is ..\n", " primary_keys='id', # required\n", " schema=taxis_sdf.schema, # either schema or df parameter is required - if df is provided, the feature store table will be automatically populated with the data, otherwise only the structure of the table will be created, it will need to be populated later\n", " df = taxis_sdf, # either schema or df parameter is required - if df is provided, the feature store table will be automatically populated with the data, otherwise only the structure of the table will be created, it will need to be populated later\n", " description='Seaborn taxi dataset features'\n", ")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "4e46c464-6b96-43c3-bc10-62bc5b3b8f25", "showTitle": false, "title": "" } }, "source": [ "

The previously created table in Feature Store can be deleted this way:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "b6db8d70-d3b1-452c-9761-381b397e50bf", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023/11/22 16:27:16 WARNING databricks.feature_store._compute_client._compute_client: Deleting a feature table can lead to unexpected failures in upstream producers and downstream consumers (models, endpoints, and scheduled jobs).\n" ] } ], "source": [ "fs.drop_table(name='seaborn_db.taxidataset')" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "6bd6a1c4-6664-4dc9-bf3c-1d430241c1bf", "showTitle": false, "title": "" } }, "source": [ "

Alternatively, an empty table structure can be created in the Feature store to be populated later. The command below only creates the structure of the table in Feature Store:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "971f5cc3-aed0-451e-bff0-bc670bd31b35", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023/11/22 16:27:21 INFO databricks.feature_store._compute_client._compute_client: Created feature table 'hive_metastore.seaborn_db.taxidataset'.\n" ] } ], "source": [ "customer_feature_table = fs.create_table(\n", " name='hive_metastore.seaborn_db.taxidataset', # format is ..\n", " primary_keys='id', # required\n", " schema=taxis_sdf.schema, # here only schema is provided, the feature table is created empty\n", " description='Seaborn taxi dataset features'\n", ")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bf70b9b2-855e-4cfe-8ae7-00565238c904", "showTitle": false, "title": "" } }, "source": [ "

Next step is to write data in it:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "4611f26d-5325-4178-8531-34cda6e03145", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "fs.write_table(\n", " df=taxis_sdf,\n", " name='seaborn_db.taxidataset',\n", " mode='merge' # mode = 'overwrite' could also be used in this particular case\n", ")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "34d0abe2-65cc-466d-9b39-95a695272449", "showTitle": false, "title": "" } }, "source": [ "

Finally, feature store table can be read this way:

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "76b1d93a-94a0-4cac-b546-57e00ee6ce51", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "
pickupdropoffpassengersdistancefaretiptollstotalcolorpaymentpickup_zonedropoff_zonepickup_boroughdropoff_boroughid
2019-03-17 13:16:132019-03-17 13:40:3212.917.00.00.020.3yellowcashClinton EastLenox Hill EastManhattanManhattan0
2019-03-22 08:43:392019-03-22 08:51:4910.556.51.960.011.76yellowcredit cardUnion SqMurray HillManhattanManhattan1
2019-03-02 21:42:072019-03-02 22:01:1512.513.53.450.020.75yellowcredit cardSoHoMurray HillManhattanManhattan2
2019-03-13 21:49:002019-03-13 22:02:0412.6811.01.00.015.8yellowcredit cardEast ChelseaAlphabet CityManhattanManhattan3
2019-03-03 12:55:382019-03-03 13:01:0510.855.50.00.08.8yellowcashMidtown CenterMurray HillManhattanManhattan4
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "aggData": [], "aggError": "", "aggOverflow": false, "aggSchema": [], "aggSeriesLimitReached": false, "aggType": "", "arguments": {}, "columnCustomDisplayInfos": {}, "data": [ [ "2019-03-17 13:16:13", "2019-03-17 13:40:32", 1, 2.9, 17, 0, 0, 20.3, "yellow", "cash", "Clinton East", "Lenox Hill East", "Manhattan", "Manhattan", 0 ], [ "2019-03-22 08:43:39", "2019-03-22 08:51:49", 1, 0.55, 6.5, 1.96, 0, 11.76, "yellow", "credit card", "Union Sq", "Murray Hill", "Manhattan", "Manhattan", 1 ], [ "2019-03-02 21:42:07", "2019-03-02 22:01:15", 1, 2.5, 13.5, 3.45, 0, 20.75, "yellow", "credit card", "SoHo", "Murray Hill", "Manhattan", "Manhattan", 2 ], [ "2019-03-13 21:49:00", "2019-03-13 22:02:04", 1, 2.68, 11, 1, 0, 15.8, "yellow", "credit card", "East Chelsea", "Alphabet City", "Manhattan", "Manhattan", 3 ], [ "2019-03-03 12:55:38", "2019-03-03 13:01:05", 1, 0.85, 5.5, 0, 0, 8.8, "yellow", "cash", "Midtown Center", "Murray Hill", "Manhattan", "Manhattan", 4 ] ], "datasetInfos": [], "dbfsResultPath": null, "isJsonSchema": true, "metadata": {}, "overflow": false, "plotOptions": { "customPlotOptions": {}, "displayType": "table", "pivotAggregation": null, "pivotColumns": null, "xColumns": null, "yColumns": null }, "removedWidgets": [], "schema": [ { "metadata": "{}", "name": "pickup", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff", "type": "\"string\"" }, { "metadata": "{}", "name": "passengers", "type": "\"long\"" }, { "metadata": "{}", "name": "distance", "type": "\"double\"" }, { "metadata": "{}", "name": "fare", "type": "\"double\"" }, { "metadata": "{}", "name": "tip", "type": "\"double\"" }, { "metadata": "{}", "name": "tolls", "type": "\"double\"" }, { "metadata": "{}", "name": "total", "type": "\"double\"" }, { "metadata": "{}", "name": "color", "type": "\"string\"" }, { "metadata": "{}", "name": "payment", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_zone", "type": "\"string\"" }, { "metadata": "{}", "name": "pickup_borough", "type": "\"string\"" }, { "metadata": "{}", "name": "dropoff_borough", "type": "\"string\"" }, { "metadata": "{}", "name": "id", "type": "\"long\"" } ], "type": "table" } }, "output_type": "display_data" } ], "source": [ "display(fs.read_table(name='seaborn_db.taxidataset').limit(5))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "890328ac-571f-47ca-9a4a-cbbdbe12f402", "showTitle": false, "title": "" } }, "source": [ "" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "mostRecentlyExecutedCommandWithImplicitDF": { "commandId": 886449356054794, "dataframes": [ "_sqldf" ] }, "pythonIndentUnit": 2 }, "notebookName": "Databricks-ML-professional-S01a-Data-management", "widgets": {} }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }