Apache Spark Experiments

I’m in the process of learning Apache Spark for processing and transforming large data sets, as well as machine learning. As I dig into different facets of Spark, I’m compiling notes and experiments in a series of Jupyter notebooks.

I published these notebooks to a github repo, spark-experiments. Right now it has some basic and spark-sql based experiments. I’ll be adding more as I go.

Rather than setting up Jupyter, Spark, and everything else needed locally, I found an existing Docker image, pyspark-notebook, that contains everything I needed, including matplotlib to visualize the data as I get further along. If you have Docker installed, you just run the Docker container via a single command, and you’re off and running. See the spark-experiments installation instructions for details.

Initially, I was going to create my own sample data sets for the experiments. I’m mostly interested in learning the operations and process rather than executing with a large data set across a cluster of servers, so it’s ok to use a small data set. But I hit on the idea of using publicly available data sets such as those from data.cms.gov instead. Maybe we’ll turn up something interesting, and it’ll be more real-worldish.

This entry was posted in Python, Scala and tagged . Bookmark the permalink.

One Response to Apache Spark Experiments

  1. PS: In future posts we’ll look into integrating the cluster with HDFS and into scaling the Spark cluster (maybe with Docker Swarm ?).

Leave a Reply

Your email address will not be published. Required fields are marked *