How to Parallelize and Distribute Collection in PySpark

What is PySpark?

PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. It is a popular open source framework that ensures data processing with lightning speed and supports various languages like Scala, Python, Java, and R. Using PySpark, you can work with RDDs in Python programming language also.

Create SparkContext

class pyspark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>):

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None): The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

Import Modules

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

Create entry points to spark

# create entry points to spark
try:
    #stop sparkcontext if running
    sc.stop()
except:
    pass
finally:
    #create object of SparkContext
    sc = SparkContext()
    spark = SparkSession(sparkContext=sc)

Creates an RDD with a list of Integers.

RDD(Resilient Distributed Datasets): – These are basically dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

parallelize(c, numSlices=None): Distribute a local Python collection to form an RDD.

collect(): Function is used to retrieve all the elements of the dataset

x = [1, 2, 3, 4]

#create an RDD
#parallelize() is a function in SparkContext and is used to create an RDD from a list collection.
rdd = sc.parallelize(x)

display(
    rdd,
    rdd.collect()
)

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:262

[1, 2, 3, 4]

print(type(rdd))

<class 'pyspark.rdd.RDD'>

Parallelize another example

import numpy as np

rdd1 = sc.parallelize(np.arange(0, 30, 2))

display(
    rdd1.collect()
)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

Distributes a RDD

glom(): Return an RDD created by coalescing all elements within each partition into a list.

#create an RDD and 5 is number of partition 
#np.arange(0, 30, 2) - give the list from 0 to 30 with step 2. 
rdd2 = sc.parallelize(np.arange(0, 30, 2), 5)

display(
    rdd2.collect(),
    rdd2.glom().collect(),
)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

[[0, 2, 4], [6, 8, 10], [12, 14, 16], [18, 20, 22], [24, 26, 28]]

We can see five partitons of all elements.

Another RDD partitions

#create an RDD and partition is 2
rdd3 = sc.parallelize(np.arange(0, 30, 5), 2)
rdd3.glom().collect()

[[0, 5, 10], [15, 20, 25]]

rdd4 = sc.parallelize(np.arange(0, 30), 2)
rdd4.glom().collect()

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]]

How to Parallelize and Distribute Collection in PySpark

How to Parallelize and Distribute Collection in PySpark

What is PySpark?

Create SparkContext

Import Modules

Create entry points to spark

Creates an RDD with a list of Integers.

Parallelize another example

Distributes a RDD

Another RDD partitions

kindergarten

Python for kids

Fourier series

Linear Equations

Geometry

Laplace

Vectors

Differential equations

Functions

Jacobian

Lagrangian

Waves

Electromagnetism

Optics

Quantum mechanics concepts

Theory of relativity

Kinematics

Thermodynamics

Formulae

A level physics

Chemistry

English

Geography

Animation

Plotting

SVG

Python

Machine Learning

TensorFlow

PySpark

PyTorch

Natural Language Processing

Others