Recap: Databricks Workshop on Apache Spark

Update, July 19, 2016: DataBricks has since posted their video of the night’s presentation, which we’ve now embedded below. — DSWG


On July 6th, Code for San Francisco invited everyone to their data science-focused civic hack night at the MS Reactor, where Databricks gave a discussion and workshop on their platform. CfSF members also had the option of just hanging out and/or working on their team projects.

It was a fun and informative time, with a number of the brigade’s Data Science Working Group members leaving inspired to use Databricks in their collaborative and/or “big data” projects.

Databricks is an SF-based data analysis platform hosted on Apache Spark, alongside Databrics’ own infrastructure innovations. The platform uses ec2 clusters in the backend, which are provisioned from within the platform user interface, with Python/R Notebooks and other collaborative coding tools on the front-end.

You can access a free ‘community edition’ of their platform here: databricks.com/try

The evening began with greetings and pot pies. Then, at 7pm or so, Databricks members Sameer Farooqui and Jules Damji gave their introductions, a brief history of Spark, and summarized the goals and schedule for the workshop. Spark itself has had an interesting history from 2010, when it began as a distributed framework project at UC Berkeley to where it is now at Spark 2.0, optimized for today. The goals of the evening were:

  • Discuss the architecture of their platform
  • Teach the platform using example data from SF Open Data

We used real data from data.sfgov.org as example data during the workshop. Sameer had prepared the talk around the City’s fire incident data, since 4th of July weekend had just passed and he thought such data would be interesting. Indeed, it was, especially when invoking Spark optimization cut our data processing times from ~20-30 seconds to ~.10-.15 seconds per instance!

The workshop was recorded, by the way, so, if you are interested in learning Spark on Databricks’ platform, definitely check it out! (see below).

If you just want to read along you can also check out their ‘read along doc’:
bit.ly/sfopenreadalong

Getting Started Steps:

  • Databricks Spark ReadmeCreate free account on their platform
  • Create a Spark 2.0 (RC2) Cluster
  • Download the ‘.dbc’ file (contains the data and readme) here: bit.ly/sfopenlabs
  • Import that file into the platform’s workspace.  
  • Take note of the datasets_mount and ‘Fire Incidents Exploration – RunMe’ files.
  • ‘Attach’ the ‘datasets_mount’ to the cluster.  (there is button that probably says ‘detached’, click it and select the cluster you created)
  • ‘Run All’ the ‘datasets_mount’ to create some folders with the example data.
  • ‘Attach’ the ‘Fire Incidents Exploration – RunMe’

Other items from that evening:

Jennifer Wong collected signatures to open up important data from city government.

Jennifer of WeVote also mentioned that her group is hosting an iOS Devcamp July 22nd through 24th, comprising a civic hack track. Also if you are a woman, you get 50% off admission.

Quote from the night:

    “Where do good ideas com from? When people collaborate.”