Introduction to Databricks Unified Data Platform [5 min demo]

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(upbeat music) - [Tutor] Databricks provides a unified open platform for all your data. It empowers data scientists, data engineers, and data analysts with a simple collaborative environment to run interactive and scheduled data analysis workloads. Databricks is from the original creators of some of the world's most popular open source projects, Apache Spark, Delta Lake, MLflow, and Koalas. It builds on these technologies to deliver a true Lakehouse architecture, combining the best of data lakes and data warehouses for a fast, scalable and reliable data platform. Built for the Cloud, your data is stored in low cost Cloud object stores, such as AWS S3 and Azure Data Lake storage with perform and access enabled through caching, optimized data layout, and other techniques. (upbeat music) To work with your data, you can launch clusters with hundreds of machines each with a mixture of CPUs and GPUs needed for your analysis. If you're on a large data team, policies can define limits on cluster sizes and configuration. There is a Databricks runtime for data engineers and data scientists as well as a runtime optimized for machine learning worklaods. (upbeat music) See how easy it is to create a cluster with up to 390 workers. In the data science workspace, you can create collaborative notebooks using Python, SQL, Scala or R. (upbeat music) Just like you can share your Google docs with your colleagues and groups of colleagues, you can also share these notebooks. Plus, built-in commenting tied to your code, helps you exchange ideas and updates with your collegues. (upbeat music) In addition to using notebooks for exploratory data analysis as you see here, many Databricks users love the powerful integration with machine learning frameworks like MLflow. Here, we're training a model and testing it. But we can also look up at the top here and see the MLflow experiment tracking, which records the previous experiment runs, and you can see important variables like their accuracy. Now MLflow is just one of the integrations that Databricks provides with popular frameworks for machine learning and data science. Databricks also supports a variety of other open source libraries, which are popular in the community. Want to know more about what data your colleagues have shared with you? Take a look at the data tab where you can see individual tables with schema and sample data. Importantly, you see the history of operations performed on each table, (mumbles) the transaction log. Now why does history matter? Well, it's important for compliance and security audits in many industries, but it also enables you to explore your data by another dimension, time. Let's see how by opening up this SQL analytics interface. The SQL analytics interface gives us the ability to create visualizations and dashboards as well as query our Lakehouse with performance exceeding or comparable to traditional data warehouses. We achieve this level of performance, reliability, schema enforcement, and scale through advances in Delta Lake and Delta Engine. Delta Lake is an open format storage layer built on top of Parquet, which adds ACID transactions to your Cloud Data Lake. Let's show you how the transaction log enables Delta Lake Time Travel. Here, we're looking at a series of loan risk scores based on where a property is located. When we originally created this dataset, in version zero, we didn't have any data for Iowa. We didn't have any loan applications there, but as time went on and we reached version 40, you can see that Iowa is populated with a loan risk score of eight, probably signifying that in the middle of the country, there's a little bit less in terms of natural disasters. Now let's show you the SQL that powers these queries. And here you can see that we very simply have added a version number into our SQL query to indicate when we're querying the data from. This uses the Delta Lake Time Travel feature in order to find the data at a particular point in time. Well I hope you've seen how simple and powerful Databricks can be for your entire data team. Whether data analysts, data engineers, or data scientists, they can collaborate together to do their data plus AI on Databricks. Learn more at databricks.com.

Info

Channel: Databricks

Views: 14,087

Rating: 5 out of 5

Keywords: Databricks, DeltaLake, ApacheSpark, MLflow, AI, Machine Learning, Lakehouse, Data Lakehouse, datalakehouse, machinelearning, unified analytics platform, data engineering, data science, data lake

Id: 02DBOfYrYT0

Channel Id: undefined

Length: 5min 37sec (337 seconds)

Published: Sat Jan 30 2021