Scaling Machine Learning: How to Train a Very Large Model Using Spark

In this article, we will discuss Apache Spark and its hands-on implementation using the Python-compatible PySpark.

machine learning spark

We often use libraries like Pandas and Scikit-Learn to preprocess data and train our machine learning models for personal projects or competitions on platforms like Kaggle.

However, when we deal with big data in the real world, we need an approach that can leverage many CPUs or GPUs to do data processing and, for training, machine learning models. That’s where Apache Spark comes in. 

In this article, we will discuss Apache Spark and its hands-on implementation using the Python-compatible PySpark.

*Looking for the Colab Notebook for this post? Find it right here.*

What is Apache Spark, and what are its benefits?

Apache Spark is a distributed, general-purpose computing framework. It is open-source software originally developed in AMPLab at the University of California Berkeley in 2009.

After its initial release in 2010, Spark increased in popularity among many industries and since then has grown significantly by the worldwide developer community. Spark supports development APIs in many programming languages such as Scala, Java, Python, and R

Apache Spark’s primary purpose was to address the limitations of Hadoop MapReduce. Spark reads data into memory, performs necessary operations, and writes results back—this allows for fast processing time, as opposed to MapReduce where each iteration requires disk read and write. Spark also uses in-memory caching for data reuse that makes it much faster than MapReduce. 

Some benefits of Apache Spark are:

  • It is fast and can process and query data of any size
  • It is developer-friendly due to the support provided in many programming languages like Java, Python, Scala, and R
  • It can handle multiple workloads like machine learning (Spark MLlib), interactive queries (Spark SQL), graph processing (Spark GraphX), and real-time analytics (Spark Streaming)

screen shot 2020 12 04 at 3 06 34 pm

Different ML and deep learning frameworks built on Spark

There are many machine learning and deep learning frameworks developed on top of Spark including the following:

Machine learning using Spark MLlib

MLlib is a machine learning library included in the Spark framework. It was developed to do machine learning at scale with ease. Below are some tools provided as part of MLlib:

  1. Machine learning algorithms: Regression, classification, clustering, collaborative filtering, etc. 
  2. Featurization: Feature selection, extraction, dimensionality reduction, transformation etc. 
  3. Pipelines: Construction, evaluation, and tuning of machine learning pipelines
  4. Persistence: Saving and loading models and pipelines   

In this section, we will build a machine learning model using PySpark (Python API of Spark) and MLlib on the sample dataset provided by Spark. We will use the Google Colab platform, which is similar to Jupyter notebooks, for coding and developing machine learning models as this is free to use and easy to set up. For real big data processing and modeling, one can use platforms like Databricks, AWS EMR, GCP Dataproc, etc. 

The dataset under consideration might look very small, as we talked about big data, but the code we are developing here can be used seamlessly with large datasets hosted on S3, HDFS, Redshift, Cassandra, Couchbase, etc. We are considering this dataset to explain the working of PySpark, MLlib, and some basic concepts.

The code for this tutorial with a detailed explanation can be found here

Is machine learning engineering the right career for you?

Knowing machine learning and deep learning concepts is important—but not enough to get you hired. According to hiring managers, most job seekers lack the engineering skills to perform the job. This is why more than 50% of Springboard's Machine Learning Career Track curriculum is focused on production engineering skills. In this course, you'll design a machine learning/deep learning system, build a prototype, and deploy a running application that can be accessed via API or web service. No other bootcamp does this.

Our machine learning training will teach you linear and logistical regression, anomaly detection, cleaning, and transforming data. We’ll also teach you the most in-demand ML models and algorithms you’ll need to know to succeed. For each model, you will learn how it works conceptually first, then the applied mathematics necessary to implement it, and finally learn to test and train them.

Find out if you're eligible for Springboard's Machine Learning Career Track.

Ready to learn more?

Browse our Career Tracks and find the perfect fit