PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. It is a popular open source framework that ensures data processing with lightning speed and supports various languages like Scala, Python, Java, and R. Using PySpark, you can work with RDDs in Python programming language also.
class pyspark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>):
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.
class pyspark.sql.SparkSession(sparkContext, jsparkSession=None): The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# create entry points to spark
try:
#stop sparkcontext if running
sc.stop()
except:
pass
finally:
#create object of SparkContext
sc = SparkContext()
spark = SparkSession(sparkContext=sc)
RDD(Resilient Distributed Datasets): – These are basically dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
parallelize(c, numSlices=None): Distribute a local Python collection to form an RDD.
collect(): Function is used to retrieve all the elements of the dataset
x = [1, 2, 3, 4]
#create an RDD
#parallelize() is a function in SparkContext and is used to create an RDD from a list collection.
rdd = sc.parallelize(x)
display(
rdd,
rdd.collect()
)
print(type(rdd))
import numpy as np
rdd1 = sc.parallelize(np.arange(0, 30, 2))
display(
rdd1.collect()
)
glom(): Return an RDD created by coalescing all elements within each partition into a list.
#create an RDD and 5 is number of partition
#np.arange(0, 30, 2) - give the list from 0 to 30 with step 2.
rdd2 = sc.parallelize(np.arange(0, 30, 2), 5)
display(
rdd2.collect(),
rdd2.glom().collect(),
)
We can see five partitons of all elements.
#create an RDD and partition is 2
rdd3 = sc.parallelize(np.arange(0, 30, 5), 2)
rdd3.glom().collect()
rdd4 = sc.parallelize(np.arange(0, 30), 2)
rdd4.glom().collect()