The following is a quickstart for running Flask on Spark.
Most of the example tutorials I have found are for running a bunch of spark jobs on a spark cluster and returning a result. I was interested in long-running tasks and seeing if I could build a web app that ran on Spark. I thought it would be possible but I didn’t think it would be this easy. Please note that this same procedure will work for lots of python scripts, and I am interested to see what else I can load into Spark.
Prerequisites:
- Java runtime
- Python 3 (It will probably run in python 2 with minor changes)
- Apache Spark (instructions below)
If you haven’t installed Spark then grab the latest build from
https://spark.apache.org/downloads.html and untar it in a directory somewhere.
This doesn’t need to be anywhere special and I just used ~/Downloads/
Change into the root of the extracted Spark directory::
cd ~/Downloads/spark-2.3.0-bin-hadoop2.7/
Create the following file start_standalone.sh
to start spark. This is optional but it helps me. Set
JAVA_HOME
correctly for you system, the trick below works for me.
#! /bin/sh
#
# start_standalone.sh
#
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
./sbin/start-master.sh
./sbin/start-slave.sh spark://$(hostname -s):7077
You should also make sure you are running the same version of python when you start Spark as when you do spark-submit. The best way to do that is to quickly create a virtualenv and activate it. Install Flask in there while we are at it.
python3 -m venv env
source env/bin/activate
pip install flask
Now we can start our cluster (of one)
./spark-standalone.sh
You should now be able to access the Spark UI at http://localhost:8080/ and see that it has 1 worker attached.
Next we write our Python Flask script. I have created 2 routes, one of which
calculates Pi using the example from the source code
examples/src/main/python/pi.py
with a few tweaks to remove the arguments and
change them to GET parameters. `
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#
"""
Flask on Spark example.
Run with:
./bin/spark-submit --master spark://$(hostname -s):7077 exampleweb.py
"""
import sys
sys.path.append( './python' )
from random import random
from operator import add
from flask import Flask, request
from pyspark.sql import SparkSession
app = Flask(__name__)
spark = SparkSession\
.builder\
.appName("Flark - Flask on Spark")\
.getOrCreate()
@app.route("/")
def hello():
return "Hello World! There is a spark example at <a href=\"/pi?partitions=1\">/pi</a>"
@app.route("/pi")
def pi():
try:
partitions = int(request.args.get('partitions', '1'))
except Exception as e:
return e
partitions = 4
n = 1000000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
return "Pi is roughly %f" % (4.0 * count / n)
if __name__ == "__main__":
app.run()
Now we can start our Flask application by submitting it to spark.
./bin/spark-submit --master spark://$(hostname -s):7077 exampleweb.py
And then access it at http://localhost:5000/ and http://localhost:5000/pi?partitions=1