Feature Transformer VectorAssembler in PySpark ML Feature - Part 3

What is VectorAssembler?

class pyspark.ml.feature.VectorAssembler(inputCols=None, outputCol=None, handleInvalid='error'):

VectorAssembler is a transformer that combines a given list of columns into a single vector column.

It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.

Note: For VectorAssembler, we do not need StringIndexer and OneHotEncoder, if your data have all numeric values. In this example we have string columns, so we are using StringIndexer and OneHotEncoder.

Let us see an example

Create SparkSession

In [1]:
#import SparkSession
from pyspark.sql import SparkSession

SparkSession is an entry point to Spark to work with RDD, DataFrame, and Dataset. To create SparkSession in Python, we need to use the builder() method and calling getOrCreate() method.

If SparkSession already exists it returns otherwise create a new SparkSession.

In [2]:
spark = SparkSession.builder.appName('xvspark').getOrCreate()

Create dataframe by declaring the schema

In [3]:
from pyspark.sql.types import *

StructType class to define the structure of the DataFrame.

In [4]:
#create the structure of schema
schema = StructType().add("id","integer").add("name","string").add("qualification","string").add("age", "integer").add("gender", "string").add("passed", "integer")
In [5]:
#create data
data = [
    (1,'John',"B.A.", 20, "Male", 1),
    (2,'Martha',"B.Com.", 20, "Female", 1),
    (3,'Mona',"B.Com.", 21, "Female", 1),
    (4,'Harish',"B.Sc.", 22, "Male", 1),
    (5,'Jonny',"B.A.", 22, "Male", 0),
    (6,'Maria',"B.A.", 23, "Female", 1),
    (7,'Monalisa',"B.A.", 21, "Female", 0)
In [6]:
#create dataframe
df = spark.createDataFrame(data, schema=schema)
In [7]:
#columns of dataframe
['id', 'name', 'qualification', 'age', 'gender', 'passed']
In [8]:
| id|    name|qualification|age|gender|passed|
|  1|    John|         B.A.| 20|  Male|     1|
|  2|  Martha|       B.Com.| 20|Female|     1|
|  3|    Mona|       B.Com.| 21|Female|     1|
|  4|  Harish|        B.Sc.| 22|  Male|     1|
|  5|   Jonny|         B.A.| 22|  Male|     0|
|  6|   Maria|         B.A.| 23|Female|     1|
|  7|Monalisa|         B.A.| 21|Female|     0|

Apply StringIndexer & OneHotEncoder to qualification and gender columns

In [9]:
#import required libraries
from pyspark.ml.feature import StringIndexer

Apply StringIndexer to qualification column

In [10]:
qualification_indexer = StringIndexer(inputCol="qualification", outputCol="qualificationIndex")

#Fits a model to the input dataset with optional parameters.
df = qualification_indexer.fit(df).transform(df)
| id|    name|qualification|age|gender|passed|qualificationIndex|
|  1|    John|         B.A.| 20|  Male|     1|               0.0|
|  2|  Martha|       B.Com.| 20|Female|     1|               1.0|
|  3|    Mona|       B.Com.| 21|Female|     1|               1.0|
|  4|  Harish|        B.Sc.| 22|  Male|     1|               2.0|
|  5|   Jonny|         B.A.| 22|  Male|     0|               0.0|
|  6|   Maria|         B.A.| 23|Female|     1|               0.0|
|  7|Monalisa|         B.A.| 21|Female|     0|               0.0|

"B.A." gets index 0 because it is the most frequent, then "B.Com" gets index 1 and "B.Sc." gets index 2.

Apply StringIndexer to gender column

In [11]:
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

#Fits a model to the input dataset with optional parameters.
df = gender_indexer.fit(df).transform(df)
| id|    name|qualification|age|gender|passed|qualificationIndex|genderIndex|
|  1|    John|         B.A.| 20|  Male|     1|               0.0|        1.0|
|  2|  Martha|       B.Com.| 20|Female|     1|               1.0|        0.0|
|  3|    Mona|       B.Com.| 21|Female|     1|               1.0|        0.0|
|  4|  Harish|        B.Sc.| 22|  Male|     1|               2.0|        1.0|
|  5|   Jonny|         B.A.| 22|  Male|     0|               0.0|        1.0|
|  6|   Maria|         B.A.| 23|Female|     1|               0.0|        0.0|
|  7|Monalisa|         B.A.| 21|Female|     0|               0.0|        0.0|

Apply OneHotEncoder to qualificationIndex column

In [12]:
from pyspark.ml.feature import OneHotEncoder
In [13]:
#onehotencoder to qualificationIndex
onehotencoder_qualification_vector = OneHotEncoder(inputCol="qualificationIndex", outputCol="qualification_vec")
df = onehotencoder_qualification_vector.fit(df).transform(df)
In [14]:
| id|    name|qualification|age|gender|passed|qualificationIndex|genderIndex|qualification_vec|
|  1|    John|         B.A.| 20|  Male|     1|               0.0|        1.0|    (2,[0],[1.0])|
|  2|  Martha|       B.Com.| 20|Female|     1|               1.0|        0.0|    (2,[1],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|     1|               1.0|        0.0|    (2,[1],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|     1|               2.0|        1.0|        (2,[],[])|
|  5|   Jonny|         B.A.| 22|  Male|     0|               0.0|        1.0|    (2,[0],[1.0])|
|  6|   Maria|         B.A.| 23|Female|     1|               0.0|        0.0|    (2,[0],[1.0])|
|  7|Monalisa|         B.A.| 21|Female|     0|               0.0|        0.0|    (2,[0],[1.0])|

Apply OneHotEncoder to genderIndex column

In [15]:
#onehotencoder to genderIndex
onehotencoder_gender_vector = OneHotEncoder(inputCol="genderIndex", outputCol="gender_vec")
df = onehotencoder_gender_vector.fit(df).transform(df)
In [16]:
| id|    name|qualification|age|gender|passed|qualificationIndex|genderIndex|qualification_vec|   gender_vec|
|  1|    John|         B.A.| 20|  Male|     1|               0.0|        1.0|    (2,[0],[1.0])|    (1,[],[])|
|  2|  Martha|       B.Com.| 20|Female|     1|               1.0|        0.0|    (2,[1],[1.0])|(1,[0],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|     1|               1.0|        0.0|    (2,[1],[1.0])|(1,[0],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|     1|               2.0|        1.0|        (2,[],[])|    (1,[],[])|
|  5|   Jonny|         B.A.| 22|  Male|     0|               0.0|        1.0|    (2,[0],[1.0])|    (1,[],[])|
|  6|   Maria|         B.A.| 23|Female|     1|               0.0|        0.0|    (2,[0],[1.0])|(1,[0],[1.0])|
|  7|Monalisa|         B.A.| 21|Female|     0|               0.0|        0.0|    (2,[0],[1.0])|(1,[0],[1.0])|

Feature transformer - VectorAssembler

We want to combine age, qualification_vec, and gender_vec into a single feature vector called features and use it to predict passed or not.

If we set VectorAssembler's input columns to age, qualification_vec, and gender_vec and output column to features.

In [17]:
from pyspark.ml.feature import VectorAssembler
In [18]:
#dataframe columns 
In [19]:
inputCols = [
In [20]:
outputCol = "features"
In [21]:
df_va = VectorAssembler(inputCols = inputCols, outputCol = outputCol)
In [22]:
df = df_va.transform(df)
In [23]:
0 [20.0, 1.0, 0.0, 0.0]
1 [20.0, 0.0, 1.0, 1.0]
2 [21.0, 0.0, 1.0, 1.0]
3 (22.0, 0.0, 0.0, 0.0)
4 [22.0, 1.0, 0.0, 0.0]
In [24]:
new_df = df.select(['features','passed'])
|          features|passed|
|[20.0,1.0,0.0,0.0]|     1|
|[20.0,0.0,1.0,1.0]|     1|
|[21.0,0.0,1.0,1.0]|     1|
|    (4,[0],[22.0])|     1|
|[22.0,1.0,0.0,0.0]|     0|
|[23.0,1.0,0.0,1.0]|     1|
|[21.0,1.0,0.0,1.0]|     0|

Using Pipeline

In [25]:
#import module
from pyspark.ml import Pipeline

Reload Data

In [26]:
#create the structure of schema
schema = StructType().add("id","integer").add("name","string").add("qualification","string").add("age", "integer").add("gender", "string").add("passed", "integer")
In [27]:
#create data
data = [
    (1,'John',"B.A.", 20, "Male", 1),
    (2,'Martha',"B.Com.", 20, "Female", 1),
    (3,'Mona',"B.Com.", 21, "Female", 1),
    (4,'Harish',"B.Sc.", 22, "Male", 1),
    (5,'Jonny',"B.A.", 22, "Male", 0),
    (6,'Maria',"B.A.", 23, "Female", 1),
    (7,'Monalisa',"B.A.", 21, "Female", 0)
In [28]:
df = spark.createDataFrame(data, schema=schema)
| id|    name|qualification|age|gender|passed|
|  1|    John|         B.A.| 20|  Male|     1|
|  2|  Martha|       B.Com.| 20|Female|     1|
|  3|    Mona|       B.Com.| 21|Female|     1|
|  4|  Harish|        B.Sc.| 22|  Male|     1|
|  5|   Jonny|         B.A.| 22|  Male|     0|
|  6|   Maria|         B.A.| 23|Female|     1|
|  7|Monalisa|         B.A.| 21|Female|     0|

Create Pipeline and pass all stages

In [29]:
#Convert qualification and gender columns to numeric
qualification_indexer = StringIndexer(inputCol="qualification", outputCol="qualificationIndex")
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

#Convert qualificationIndex and genderIndex
onehot_encoder = OneHotEncoder(inputCols=["qualificationIndex", "genderIndex"],
                        outputCols=["qualification_vec", "gender_vec"])

#Merge multiple columns into a vector column
vector_assembler = VectorAssembler(inputCols=['age', 'qualification_vec', 'gender_vec'],

#Create pipeline and pass it to stages
pipeline = Pipeline(stages=[qualification_indexer, 

#fit and transform
df_transformed = pipeline.fit(df).transform(df)
| id|    name|qualification|age|gender|passed|qualificationIndex|genderIndex|qualification_vec|   gender_vec|          features|
|  1|    John|         B.A.| 20|  Male|     1|               0.0|        1.0|    (2,[0],[1.0])|    (1,[],[])|[20.0,1.0,0.0,0.0]|
|  2|  Martha|       B.Com.| 20|Female|     1|               1.0|        0.0|    (2,[1],[1.0])|(1,[0],[1.0])|[20.0,0.0,1.0,1.0]|
|  3|    Mona|       B.Com.| 21|Female|     1|               1.0|        0.0|    (2,[1],[1.0])|(1,[0],[1.0])|[21.0,0.0,1.0,1.0]|
|  4|  Harish|        B.Sc.| 22|  Male|     1|               2.0|        1.0|        (2,[],[])|    (1,[],[])|    (4,[0],[22.0])|
|  5|   Jonny|         B.A.| 22|  Male|     0|               0.0|        1.0|    (2,[0],[1.0])|    (1,[],[])|[22.0,1.0,0.0,0.0]|
|  6|   Maria|         B.A.| 23|Female|     1|               0.0|        0.0|    (2,[0],[1.0])|(1,[0],[1.0])|[23.0,1.0,0.0,1.0]|
|  7|Monalisa|         B.A.| 21|Female|     0|               0.0|        0.0|    (2,[0],[1.0])|(1,[0],[1.0])|[21.0,1.0,0.0,1.0]|

In [30]:
df_transformed = df_transformed.select(['features','passed'])
|          features|passed|
|[20.0,1.0,0.0,0.0]|     1|
|[20.0,0.0,1.0,1.0]|     1|
|[21.0,0.0,1.0,1.0]|     1|
|    (4,[0],[22.0])|     1|
|[22.0,1.0,0.0,0.0]|     0|
|[23.0,1.0,0.0,1.0]|     1|
|[21.0,1.0,0.0,1.0]|     0|

You can convert it to Pandas DataFrame

In [31]:
features passed
0 [20.0, 1.0, 0.0, 0.0] 1
1 [20.0, 0.0, 1.0, 1.0] 1
2 [21.0, 0.0, 1.0, 1.0] 1
3 (22.0, 0.0, 0.0, 0.0) 1
4 [22.0, 1.0, 0.0, 0.0] 0
5 [23.0, 1.0, 0.0, 1.0] 1
6 [21.0, 1.0, 0.0, 1.0] 0
In [ ]:

