User Tools

Site Tools


databricks

Databricks

Return to Databricks Cloud Shell and Apache Spark

Snippet from Wikipedia: Databricks

Databricks, Inc. is a global data, analytics and artificial intelligence company founded by the original creators of Apache Spark.

The company provides a cloud-based platform to help enterprises build, scale, and govern data and AI, including generative AI and other machine learning models.

Databricks pioneered the data lakehouse, a data and AI platform that combines the capabilities of a data warehouse with a data lake, allowing organizations to manage and use both structured and unstructured data for traditional business analytics and AI workloads.

Databricks acquired MosaicML for $1.4 billion in June 2023, its largest acquisition.

In November 2023, Databricks unveiled the Databricks Data Intelligence Platform, a new offering that combines the unification benefits of the lakehouse with MosaicML’s Generative AI technology to enable customers to better understand and use their own proprietary data.

The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.

AN OPEN AND UNIFIED DATA ANALYTICS PLATFORM FOR DATA ENGINEERING, MACHINE LEARNING, AND ANALYTICS

From the original creators of Apache Spark, Delta Lake, MLflow, and Koalas

Select a platform:

DATABRICKS PLATFORM - FREE TRIAL

For businesses

  • Collaborative environment for Data teams to build solutions together
  • Unlimited clusters that can scale to any size, processing data in your own account
  • Job scheduler to execute jobs for production pipelines
  • Fully collaborative notebooks with multi-language support, dashboards, REST APIs
  • Native integration with the most popular ML frameworks (scikit-learn, TensorFlow, Keras,…), Apache SparkTM, Delta Lake, and MLflow
  • Advanced security, role-based access controls, and audit logs
  • Single Sign On support
  • Integration with BI tools such as Tableau, Qlik, and Looker
  • 14-day full feature trial (excludes cloud charges)

CHOOSE YOUR CLOUD

Please note that Azure Databricks is provided by Microsoft Azure and is subject to Microsoft's terms. By clicking on the “AWS” button to get started, you agree to the Databricks Terms of Service. By clicking on the “Google Cloud” button to get started, you agree to the Databricks Terms of Service.

COMMUNITY EDITION

For students and educational institutions

  • Single Spark cluster limited to 15GB and no worker nodes
  • Basic notebooks without collaboration
  • Limited to 3 max users
  • Public environment to share your work

By clicking “Get Started” for the Community Edition, you agree to the Databricks Community Edition Terms of Service.

https://databricks.com/try-platform


Welcome to Databricks Community Edition!

Databricks Community Edition provides you with access to a free Spark micro-cluster as well as a cluster manager and a notebook environment - ideal for developers, data scientists, data engineers and other IT professionals to get started with Spark.

We need you to verify your email address by clicking on this link. You will then be redirected to Databricks Community Edition!

Get started by visiting: https://community.cloud.databricks.com/login.html

If you have any questions, please contact [email protected].

- The Databricks Team


Instructor: Adam Breindel

LinkedIn: https://www.linkedin.com/in/adbreind

Email: [email protected]

Twitter: @adbreind - 20+ years building systems for startups and large enterprises - 10+ years teaching data, ML, front- and back-end technology - Fun large-scale data projects… - Streaming neural net + decision tree fraud scoring - Realtime & offline analytics for banking - Music synchronization and licensing for networked jukeboxes - Industries - Finance, Insurance - Travel, Media / Entertainment - Energy, Government

Create a Databricks account • Sign up for free Community Edition now at https://databricks.com/try-databricks • Use Firefox, Chrome or Safari

Getting Started

These steps are illustrated on subsequent pages; this is the summary: 1. Copy the courseware link or prepare to type it ☺ https://materials.s3.amazonaws.com/2021/oreilly0708/spark.dbc

2. Import that file into your Databricks account per the instructions on the following slides.

3. Create a cluster: choose Databricks Runtime 8.2 (illustrated in the following slides) Setup with Databricks

3 Log in to Databricks

4 Import Notebooks… 1 2 3 5 Import Notebooks for Today… Type or paste in today’s notebook URL … then click Import Choose URL… https://materials.s3.amazonaws.com/2021/oreilly0708/spark.dbc 6 Find your notebook(s) here… Click Workspace Here are your notebooks 7 Create a Cluster 1 2 8 Create a Cluster 3 4 5 Runtime 8.2 (Scala 2.12, Spark 3.1.1) 9 All set: let's go!

https://learning.oreilly.com/live-events/spark-31-first-steps/0636920371533/0636920061945/

https://on24static.akamaized.net/event/33/46/13/4/rt/1/documents/resourceList1631717069306/setup.pdf


Create Cluster New Cluster 0 Workers:0 GB Memory, 0 Cores, 0 DBU 1 Driver:15.3 GB Memory, 2 Cores, 1 DBU UI|JSON Cluster Name Buddha Databricks Runtime Version Runtime: 8.3 (Scala 2.12, Spark 3.1.1) NoteDatabricks Runtime 8.x and later use Delta Lake as the default table format. Learn more Instance Free 15 GB Memory: As a Community Edition user, your cluster will automatically terminate after an idle period of two hours. For more configuration options, please upgrade your Databricks subscription. Instances Spark


Import Notebooks

Import from: File URL

https://materials.s3.amazonaws.com/2021/oreilly0708/spark.dbc

Accepted formats: .dbc, .scala, .py, .sql, .r, .ipynb, .Rmd, .html, .zip


This notebook is not attached to a cluster. Would you like to launch a new Spark cluster to continue working?

Automatically launch and attach to clusters without prompting


First Steps with Apache Spark 3.1

First Steps with Apache Spark 3.1

Class Logistics and Operations


Topics


%fs

ls /databricks-datasets/amazon/ path - name - size

  • dbfs:/databricks-datasets/amazon/README.md
  • dbfs:/databricks-datasets/amazon/data20K/
  • dbfs:/databricks-datasets/amazon/test4K/
  • dbfs:/databricks-datasets/amazon/users/

Where is this Spark data source? Currently, Amazon S3

In this version of Databricks, “DBFS” (a wrapper over S3 similar to EMRFS) is the default Spark filesystem.

Other common defaults include HDFS, “local”, another cloud-based object store like Azure Blob Storage, or a Kubernetes-friendly storage layer like Minio.


%sql SELECT * FROM parquet.`/databricks-datasets/amazon/data20K`


%sql SELECT * FROM parquet.`/databricks-datasets/amazon/data20K`

Spark Jobs

Job 2 View(Stages: 1/1)

df:org.apache.spark.sql.DataFrame

rating:double

review:string

df: org.apache.spark.sql.DataFrame = [rating: double, review: string]


databricks.txt · Last modified: 2021/10/07 12:41 by 127.0.0.1