{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "58fab4bb-231e-48cf-8ed4-fc15a1b22845", "showTitle": false, "title": "" } }, "source": [ "
This Notebook adds information related to the following requirements:
Download this notebook at format ipynb here.
\n", "pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.6 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.0 | 0.0 | 9.3 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.7 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.1 | 0.0 | 13.4 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Dataframes are saved by default as managed delta tables when the location is not specified:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "92c6fbbf-0a08-4fee-8ad7-abdf5a0f9ea4", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .saveAsTable(\"taxis_sdf\"))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bd1a26b8-58ce-421a-82a1-420786c4b143", "showTitle": false, "title": "" } }, "source": [ "When location is specified, they are external delta tables:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c0a1d573-d054-48bb-864a-fb9eab2efaa3", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .save(\"/temp/taxis_sdf\"))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "82818b27-26e1-47fe-8130-9512f505564b", "showTitle": false, "title": "" } }, "source": [ "Managed delta tables can be listed with the following command:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "2373152d-1734-43b0-9f48-db06fd29b65a", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "database | tableName | isTemporary |
---|---|---|
default | amsterdam_airbnb_df | false |
default | csv_stud | false |
default | diamonds | false |
default | diamonds_df_not_partitioned | false |
default | diamonds_df_partitioned | false |
Delta tables can be read this way:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "26108474-ec39-4c7c-9e04-d91d52ebc434", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "taxis_sdf = spark.table(\"taxis_sdf\") # read managed delta table\n", "taxi_sdf_ext = spark.read.load(\"/temp/taxis_sdf\") # read external delta table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "34ff3eea-821d-4915-8de5-c2ef8da0144c", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.6 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.0 | 0.0 | 9.3 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.7 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.1 | 0.0 | 13.4 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Let's do a small modification to the table and save it again:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "767c2752-5c38-4578-b314-d8046855e60d", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# remove nulls for payment column\n", "taxis_sdf = taxis_sdf.filter(\"payment is not null\")\n", "#\n", "# save the change to managed delta table\n", "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .saveAsTable(\"taxis_sdf\"))\n", "#\n", "# save the change to external delta table\n", "(taxis_sdf.write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"True\")\n", " .format(\"delta\")\n", " .save(\"/temp/taxis_sdf\"))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "47f64edc-5bc2-40e3-bf8c-9dbd6a5431f8", "showTitle": false, "title": "" } }, "source": [ "See changes in managed table:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "implicitDf": true, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "dff5f4fd-eb40-4863-a205-4d7b322476a9", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "spark.sql(\"DESCRIBE HISTORY taxis_sdf\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "2633c9ea-fde4-4c51-bec7-743f315852b4", "showTitle": false, "title": "" } }, "source": [ "See changes in external table:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "implicitDf": true, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "9c6fc2d6-73b3-4826-b96b-b2f6b1414b3c", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "spark.sql(\"DESCRIBE HISTORY '/temp/taxis_sdf/'\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "b504e861-437e-4e16-a25e-eda23b7da284", "showTitle": false, "title": "" } }, "source": [ "The following command lets load a previous version. We see that loading version 0
of table get null values for column payment
back:
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-03-08 02:56:38 | 2019-03-08 03:07:24 | 1 | 2.4 | 10.5 | 0.0 | 0.0 | 14.3 | yellow | null | Murray Hill | West Village | Manhattan | Manhattan |
2019-03-02 19:01:36 | 2019-03-02 19:08:46 | 0 | 1.4 | 7.0 | 0.0 | 0.0 | 10.3 | yellow | null | Upper East Side South | Murray Hill | Manhattan | Manhattan |
2019-03-23 11:07:11 | 2019-03-23 11:32:46 | 1 | 0.0 | 19.0 | 0.0 | 0.0 | 22.3 | yellow | null | West Village | Upper East Side South | Manhattan | Manhattan |
2019-03-15 00:10:38 | 2019-03-15 00:21:39 | 1 | 3.1 | 12.5 | 0.0 | 0.0 | 16.3 | yellow | null | West Chelsea/Hudson Yards | Upper West Side South | Manhattan | Manhattan |
2019-03-30 11:38:20 | 2019-03-30 11:40:45 | 1 | 0.2 | 3.5 | 0.0 | 0.0 | 6.8 | yellow | null | Upper West Side North | Upper West Side North | Manhattan | Manhattan |
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-03-08 02:56:38 | 2019-03-08 03:07:24 | 1 | 2.4 | 10.5 | 0.0 | 0.0 | 14.3 | yellow | null | Murray Hill | West Village | Manhattan | Manhattan |
2019-03-02 19:01:36 | 2019-03-02 19:08:46 | 0 | 1.4 | 7.0 | 0.0 | 0.0 | 10.3 | yellow | null | Upper East Side South | Murray Hill | Manhattan | Manhattan |
2019-03-23 11:07:11 | 2019-03-23 11:32:46 | 1 | 0.0 | 19.0 | 0.0 | 0.0 | 22.3 | yellow | null | West Village | Upper East Side South | Manhattan | Manhattan |
2019-03-15 00:10:38 | 2019-03-15 00:21:39 | 1 | 3.1 | 12.5 | 0.0 | 0.0 | 16.3 | yellow | null | West Chelsea/Hudson Yards | Upper West Side South | Manhattan | Manhattan |
2019-03-30 11:38:20 | 2019-03-30 11:40:45 | 1 | 0.2 | 3.5 | 0.0 | 0.0 | 6.8 | yellow | null | Upper West Side North | Upper West Side North | Manhattan | Manhattan |
Let's add an index to the table:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "c306d83e-f531-43a7-9204-47251bdf3691", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "taxis_sdf = taxis_sdf.withColumn(\"id\", monotonically_increasing_id())" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "4bad54d4-0ad6-40aa-ae13-691ca37de84f", "showTitle": false, "title": "" } }, "source": [ "A Catalog and schema must exist. Then if one them is missing, it is necessary to create it/them. For the example here, let's use the default existing Catalog hive_metastore
and create a Schema in it:
The following command is to create and populate a feature store table in schema seaborn_df
from hive_metastore
Catalog:
The previously created table in Feature Store can be deleted this way:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "b6db8d70-d3b1-452c-9761-381b397e50bf", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023/11/22 16:27:16 WARNING databricks.feature_store._compute_client._compute_client: Deleting a feature table can lead to unexpected failures in upstream producers and downstream consumers (models, endpoints, and scheduled jobs).\n" ] } ], "source": [ "fs.drop_table(name='seaborn_db.taxidataset')" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "6bd6a1c4-6664-4dc9-bf3c-1d430241c1bf", "showTitle": false, "title": "" } }, "source": [ "Alternatively, an empty table structure can be created in the Feature store to be populated later. The command below only creates the structure of the table in Feature Store:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "971f5cc3-aed0-451e-bff0-bc670bd31b35", "showTitle": false, "title": "" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023/11/22 16:27:21 INFO databricks.feature_store._compute_client._compute_client: Created feature table 'hive_metastore.seaborn_db.taxidataset'.\n" ] } ], "source": [ "customer_feature_table = fs.create_table(\n", " name='hive_metastore.seaborn_db.taxidataset', # format isNext step is to write data in it:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "4611f26d-5325-4178-8531-34cda6e03145", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "fs.write_table(\n", " df=taxis_sdf,\n", " name='seaborn_db.taxidataset',\n", " mode='merge' # mode = 'overwrite' could also be used in this particular case\n", ")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "34d0abe2-65cc-466d-9b39-95a695272449", "showTitle": false, "title": "" } }, "source": [ "Finally, feature store table can be read this way:
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "76b1d93a-94a0-4cac-b546-57e00ee6ce51", "showTitle": false, "title": "" } }, "outputs": [ { "data": { "text/html": [ "pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-03-17 13:16:13 | 2019-03-17 13:40:32 | 1 | 2.9 | 17.0 | 0.0 | 0.0 | 20.3 | yellow | cash | Clinton East | Lenox Hill East | Manhattan | Manhattan | 0 |
2019-03-22 08:43:39 | 2019-03-22 08:51:49 | 1 | 0.55 | 6.5 | 1.96 | 0.0 | 11.76 | yellow | credit card | Union Sq | Murray Hill | Manhattan | Manhattan | 1 |
2019-03-02 21:42:07 | 2019-03-02 22:01:15 | 1 | 2.5 | 13.5 | 3.45 | 0.0 | 20.75 | yellow | credit card | SoHo | Murray Hill | Manhattan | Manhattan | 2 |
2019-03-13 21:49:00 | 2019-03-13 22:02:04 | 1 | 2.68 | 11.0 | 1.0 | 0.0 | 15.8 | yellow | credit card | East Chelsea | Alphabet City | Manhattan | Manhattan | 3 |
2019-03-03 12:55:38 | 2019-03-03 13:01:05 | 1 | 0.85 | 5.5 | 0.0 | 0.0 | 8.8 | yellow | cash | Midtown Center | Murray Hill | Manhattan | Manhattan | 4 |