Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site
Apache Spark is a unified analytics engine for large-scale data processing.
Write applications quickly in Java, Scala, Python, R, and SQL.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.
Spark is also easier to use than Hadoop; for instance, the word-counting MapReduce example takes about 50 lines of code in Hadoop, but it takes only 2 lines of code in Spark. As you can see, Spark is much faster, more efficient, and easier to use than Hadoop.
Combine SQL, streaming, and complex analytics.
Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
Spark batch: spark-submit <SparkApp.py>
SparkR: sparkR
PySpark: pyspark
Spark-shell: spark-shell
spark-submit
¶ spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
PYSPARK_PYTHON=python3.6 spark-submit \
--master yarn \
--executor-memory 20G \
--num-executors 50 \
examples/src/main/python/pi.py \
1000
Spark also provides an experimental R API since 1.4 (only DataFrames APIs included).
To run Spark interactively in a R interpreter, use sparkR
Example applications are also provided in R. For example,
spark-submit examples/src/main/r/dataframe.R
It is also possible to launch the PySpark shell. Set PYSPARK_PYTHON
variable to select the approperate Python when running pyspark
command:
PYSPARK_PYTHON=python3.6 pyspark
You could use spark as a Python's module, but PySpark
isn't on sys.path
by default.
That doesn't mean it can't be used as a regular library.
You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.
To initialize PySpark, just call it within Python
import findspark
findspark.init('/usr/lib/spark-current/')
# Then you could import the `pyspark` module
import pyspark