Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

By Nutan: Date Nov 19, 2020

What is PySpark ML?

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

Logistic Regression

Logistic regression is a popular Machine Learning classification algorithm to predict a categorical response.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as yes/no, pass/fail, win/lose, alive/dead or healthy/sick.

That binary variable contains data coded as 1 (yes, pass, win, alive, etc.) or 0 (no, fail, lose, dead, etc.).

Data Set Information:

This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.

If you want to download data, you can download here.

This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.

Attribute Information:

Class: no-recurrence-events, recurrence-events
age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
menopause: lt40, ge40, premeno.
tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
node-caps: yes, no.
deg-malig: 1, 2, 3.
breast: left, right.
breast-quad: left-up, left-low, right-up, right-low, central.
irradiat: yes, no.

Create SparkSession

#import SparkSession
from pyspark.sql import SparkSession

SparkSession is an entry point to Spark to work with RDD, DataFrame, and Dataset. To create SparkSession in Python, we need to use the builder() method and calling getOrCreate() method.

If SparkSession already exists it returns otherwise create a new SparkSession.

spark = SparkSession.builder.appName('regression').getOrCreate()

Load data

#read the dataset
df = spark.read.csv('input/breast-cancer.csv', inferSchema=True, header=True)

#view five records
df.show(5)

+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
only showing top 5 rows

#print dataframe columns and count
print(df.columns)
print(df.count())

['class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
286

df.printSchema()

root
 |-- class: string (nullable = true)
 |-- age: string (nullable = true)
 |-- menopause: string (nullable = true)
 |-- tumor-size: string (nullable = true)
 |-- inv-nodes: string (nullable = true)
 |-- node-caps: string (nullable = true)
 |-- deg-malig: integer (nullable = true)
 |-- breast: string (nullable = true)
 |-- breast-quad: string (nullable = true)
 |-- irradiat: string (nullable = true)

Missing records

from pyspark.sql.functions import isnan, when, count, col

Check missing value for single column

df.filter(df['age'].isNull()).show()

+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
|class|age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+

We can see there is no null value for age column.

Check missing value for all columns

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
|class|age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
|    0|  0|        0|         0|        0|        0|        0|     0|          0|       0|
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+

We can see, there is no null values.

Drop null records

In this dataset there is no null values. In case you have, then you can use df.na.drop().

It returns the dataset after dropping missing values. So, you can use show after calling drop().

df.na.drop().show(5)

+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
only showing top 5 rows

As there were no missing values, the number of records remains the same.

print(df.count())

286

Convert categorical data to numerical

We are going to use StringIndexer and OneHotEncoder of PySpark ML feature to convert string columns to numeric.

If you want to know more about them, you can go through my previous articles:

Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1

Role of OneHotEncoder and Pipelines in PySpark ML Feature — Part 2

#import libraries
from pyspark.ml.feature import StringIndexer, OneHotEncoder

Change class column into numeric value

Let us check 'class' column:

df.groupBy('class').count().show()

+--------------------+-----+
|               class|count|
+--------------------+-----+
|no-recurrence-events|  201|
|   recurrence-events|   85|
+--------------------+-----+

class is the target column, that has two distinct values, 'no-recurrence-events' and 'recurrence-events'. We are going to change them into numeric.

As there will be only two values 0 and 1 after converting to numeric, we will not use one-hot encoding.

class_indexer = StringIndexer(inputCol="class", outputCol="label")

#Fit and transform the dataframe
df = class_indexer.fit(df).transform(df)

df.show(5)

+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|label|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|  0.0|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|  0.0|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|  0.0|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|  0.0|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|  0.0|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+
only showing top 5 rows

Now, you can see a new column named 'label'. You can print just 'class' and 'label' columsn to see the transformation.

df.select(['class', 'label']).show(5)

+--------------------+-----+
|               class|label|
+--------------------+-----+
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
+--------------------+-----+
only showing top 5 rows

Create function to transform string column to numeric

We have lot of string columns, so we will create a function to convert string columns to numeric.

We will use a pattern for StringIndexer output columns: input column name + '-index'. For example:

'age' -> 'age-index'.

The output column shall be passed to OneHotEncoder. Similar to the above, we will use a pattern for OneHotEncoder output columns: input column name + '-vector'. Thus we will finally get, for example,

'age' -> 'age-index' -> 'age-vector'.

def transformColumnsToNumeric(df, inputCol):
    
    #apply StringIndexer to inputCol
    inputCol_indexer = StringIndexer(inputCol = inputCol, outputCol = inputCol + "-index").fit(df)
    df = inputCol_indexer.transform(df)
    
    onehotencoder_vector = OneHotEncoder(inputCol = inputCol + "-index", outputCol = inputCol + "-vector")
    df = onehotencoder_vector.fit(df).transform(df)
    
    return df
    
    pass

df = transformColumnsToNumeric(df, "age")
df = transformColumnsToNumeric(df, "menopause")
df = transformColumnsToNumeric(df, "tumor-size")
df = transformColumnsToNumeric(df, "inv-nodes")
df = transformColumnsToNumeric(df, "node-caps")
df = transformColumnsToNumeric(df, "breast")
df = transformColumnsToNumeric(df, "breast-quad")
df = transformColumnsToNumeric(df, "irradiat")
df.show(5)

+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+---------+-------------+---------------+----------------+----------------+-----------------+---------------+----------------+---------------+----------------+------------+-------------+-----------------+------------------+--------------+---------------+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|label|age-index|   age-vector|menopause-index|menopause-vector|tumor-size-index|tumor-size-vector|inv-nodes-index|inv-nodes-vector|node-caps-index|node-caps-vector|breast-index|breast-vector|breast-quad-index|breast-quad-vector|irradiat-index|irradiat-vector|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+---------+-------------+---------------+----------------+----------------+-----------------+---------------+----------------+---------------+----------------+------------+-------------+-----------------+------------------+--------------+---------------+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|  0.0|      3.0|(5,[3],[1.0])|            0.0|   (2,[0],[1.0])|             0.0|   (10,[0],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         0.0|(1,[0],[1.0])|              0.0|     (5,[0],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|  0.0|      1.0|(5,[1],[1.0])|            0.0|   (2,[0],[1.0])|             2.0|   (10,[2],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         1.0|    (1,[],[])|              2.0|     (5,[2],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|  0.0|      1.0|(5,[1],[1.0])|            0.0|   (2,[0],[1.0])|             2.0|   (10,[2],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         0.0|(1,[0],[1.0])|              0.0|     (5,[0],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|  0.0|      2.0|(5,[2],[1.0])|            1.0|   (2,[1],[1.0])|             3.0|   (10,[3],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         1.0|    (1,[],[])|              1.0|     (5,[1],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|  0.0|      1.0|(5,[1],[1.0])|            0.0|   (2,[0],[1.0])|             7.0|   (10,[7],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         1.0|    (1,[],[])|              3.0|     (5,[3],[1.0])|           0.0|  (1,[0],[1.0])|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+---------+-------------+---------------+----------------+----------------+-----------------+---------------+----------------+---------------+----------------+------------+-------------+-----------------+------------------+--------------+---------------+
only showing top 5 rows

You can put all these in a Pipeline, but I wanted to keep it as simple as possible.

Feature transformer - VectorAssembler

If you are new to VectorAssembler, you may read my article

Feature Transformer VectorAssembler in PySpark ML Feature — Part 3

from pyspark.ml.feature import VectorAssembler

Let us list the columns

df.columns

['class',
 'age',
 'menopause',
 'tumor-size',
 'inv-nodes',
 'node-caps',
 'deg-malig',
 'breast',
 'breast-quad',
 'irradiat',
 'label',
 'age-index',
 'age-vector',
 'menopause-index',
 'menopause-vector',
 'tumor-size-index',
 'tumor-size-vector',
 'inv-nodes-index',
 'inv-nodes-vector',
 'node-caps-index',
 'node-caps-vector',
 'breast-index',
 'breast-vector',
 'breast-quad-index',
 'breast-quad-vector',
 'irradiat-index',
 'irradiat-vector']

We select only columns as inputCols that we need to feed to our Spark ML model. Let us define a VectorAssembler:

inputCols=[
        'deg-malig',
        'age-vector',
        'menopause-vector',
        'tumor-size-vector',
        'inv-nodes-vector',
        'node-caps-vector',
        'breast-vector',
        'breast-quad-vector',
        'irradiat-vector']

df_va = VectorAssembler(inputCols = inputCols, outputCol="features")

Now, we can transform dataset with our VectorAssembler. It will outputCol 'features' as we stated in our VectorAssembler.

df = df_va.transform(df)

let us check the input and output columns

df.select(inputCols + ["features"] ).show(5)

+---------+-------------+----------------+-----------------+----------------+----------------+-------------+------------------+---------------+--------------------+
|deg-malig|   age-vector|menopause-vector|tumor-size-vector|inv-nodes-vector|node-caps-vector|breast-vector|breast-quad-vector|irradiat-vector|            features|
+---------+-------------+----------------+-----------------+----------------+----------------+-------------+------------------+---------------+--------------------+
|        3|(5,[3],[1.0])|   (2,[0],[1.0])|   (10,[0],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|(1,[0],[1.0])|     (5,[0],[1.0])|  (1,[0],[1.0])|(33,[0,4,6,8,18,2...|
|        2|(5,[1],[1.0])|   (2,[0],[1.0])|   (10,[2],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|    (1,[],[])|     (5,[2],[1.0])|  (1,[0],[1.0])|(33,[0,2,6,10,18,...|
|        2|(5,[1],[1.0])|   (2,[0],[1.0])|   (10,[2],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|(1,[0],[1.0])|     (5,[0],[1.0])|  (1,[0],[1.0])|(33,[0,2,6,10,18,...|
|        2|(5,[2],[1.0])|   (2,[1],[1.0])|   (10,[3],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|    (1,[],[])|     (5,[1],[1.0])|  (1,[0],[1.0])|(33,[0,3,7,11,18,...|
|        2|(5,[1],[1.0])|   (2,[0],[1.0])|   (10,[7],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|    (1,[],[])|     (5,[3],[1.0])|  (1,[0],[1.0])|(33,[0,2,6,15,18,...|
+---------+-------------+----------------+-----------------+----------------+----------------+-------------+------------------+---------------+--------------------+
only showing top 5 rows

As we need only 'features' and 'label' columns for our model, as the data from other columns have been merged into 'features' column and 'label' is our target, let us list and see a few records.

Use False flag to avoid truncation.

df.select(['features','label']).show(10,False)

+--------------------------------------------------------------------+-----+
|features                                                            |label|
+--------------------------------------------------------------------+-----+
|(33,[0,4,6,8,18,24,26,27,32],[3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0  |
|(33,[0,2,6,10,18,24,29,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(33,[0,2,6,10,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,3,7,11,18,24,28,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(33,[0,2,6,15,18,24,30,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(33,[0,3,7,11,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,1,6,9,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0  |
|(33,[0,3,7,10,18,24,26,27,32],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,2,6,16,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,2,6,10,18,24,28,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
+--------------------------------------------------------------------+-----+
only showing top 10 rows

So, finally we create a new dataset with just these two columns.

Let us view label wise dataframe

df_transformed = df.select(['features','label'])
df_transformed.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(33,[0,4,6,8,18,2...|  0.0|
|(33,[0,2,6,10,18,...|  0.0|
|(33,[0,2,6,10,18,...|  0.0|
|(33,[0,3,7,11,18,...|  0.0|
|(33,[0,2,6,15,18,...|  0.0|
+--------------------+-----+
only showing top 5 rows

df_transformed.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0|  201|
|  1.0|   85|
+-----+-----+

Split data into train and test

#split the data 
train_df, test_df = df_transformed.randomSplit([0.75,0.25])

We have split data into 75, 25 ration.

train_df.count()

203

test_df.count()

83

train_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0|  143|
|  1.0|   60|
+-----+-----+

test_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0|   58|
|  1.0|   25|
+-----+-----+

Create model and train

We are going to use LogisticRegression model as we have a binary classification problem with only two possible values: 0 and 1.

from pyspark.ml.classification import LogisticRegression

model = LogisticRegression(labelCol='label')
model

LogisticRegression_42b9c284b7bf

Now, it is time to train our model:

trained_model = model.fit(train_df)

Evaluate the model

Let us get some predictions with our trained model. We will use training data first.

train_predictions = trained_model.evaluate(train_df).predictions
train_predictions.show(5)

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(33,[0,1,6,8,18,2...|  0.0|[0.62431001631163...|[0.65119815873492...|       0.0|
|(33,[0,1,6,8,18,2...|  1.0|[-0.3250981441762...|[0.41943379599165...|       1.0|
|(33,[0,1,6,8,18,2...|  0.0|[2.48623618717003...|[0.92317127501159...|       0.0|
|(33,[0,1,6,8,18,2...|  1.0|[0.78774874575127...|[0.68734773835533...|       0.0|
|(33,[0,1,6,8,19,2...|  0.0|[-0.2770333873837...|[0.43118123021946...|       1.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows

As you see, our model made a few mistakes, like in case of 4th and 5th records, the labels and predictions don't match.

Print label, prediction and probability and ratios

Using multiple filters, we can count various metrics. Let us first count 0 and 1 in lables:

train_df_count_1 = train_df.filter(train_df['label'] == 1).count()
train_df_count_0 = train_df.filter(train_df['label'] == 0).count()
train_df_count_1, train_df_count_0

(60, 143)

Correct predictions:

cp = train_predictions.filter(
    train_predictions['label'] == 1).filter(
    train_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("Correct predictions: ", cp.count())

accuracy = (cp.count()) /  train_df_count_1
print(f"Accuracy: {accuracy}\n")


cp.show(5,False)

Correct predictions:  27
Accuracy: 0.45

+-----+----------+------------------------------------------+
|label|prediction|probability                               |
+-----+----------+------------------------------------------+
|1.0  |1.0       |[0.4194337959916568,0.5805662040083431]   |
|1.0  |1.0       |[0.2846297583189108,0.7153702416810892]   |
|1.0  |1.0       |[0.26444739242362697,0.7355526075763731]  |
|1.0  |1.0       |[2.2495127232620732E-54,1.0]              |
|1.0  |1.0       |[1.2848308887878324E-9,0.9999999987151691]|
+-----+----------+------------------------------------------+
only showing top 5 rows

False positive:

fp = train_predictions.filter(
    train_predictions['label'] == 0).filter(
    train_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("False positive: ", fp.count())

fp.show(5,False)

False positive:  8
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|0.0  |1.0       |[0.4311812302194633,0.5688187697805366] |
|0.0  |1.0       |[0.4327038561598543,0.5672961438401457] |
|0.0  |1.0       |[0.3618244073419616,0.6381755926580385] |
|0.0  |1.0       |[0.4407035684365973,0.5592964315634027] |
|0.0  |1.0       |[0.34937919764112246,0.6506208023588776]|
+-----+----------+----------------------------------------+
only showing top 5 rows

False negative:

fn = train_predictions.filter(
    train_predictions['label'] == 1).filter(
    train_predictions['prediction'] == 0).select(
    ['label','prediction','probability'])

print("False negative: ", fn.count())

fn.show(5,False)

False negative:  33
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1.0  |0.0       |[0.687347738355331,0.31265226164466897] |
|1.0  |0.0       |[0.6021960755759111,0.3978039244240888] |
|1.0  |0.0       |[0.6021960755759111,0.3978039244240888] |
|1.0  |0.0       |[0.6917036321325896,0.3082963678674105] |
|1.0  |0.0       |[0.7863190184311537,0.21368098156884627]|
+-----+----------+----------------------------------------+
only showing top 5 rows

Predict test data

test_predictions = trained_model.evaluate(test_df).predictions
test_predictions.show(5, False)

+-------------------------------------------------------------+-----+----------------------------------------+----------------------------------------+----------+
|features                                                     |label|rawPrediction                           |probability                             |prediction|
+-------------------------------------------------------------+-----+----------------------------------------+----------------------------------------+----------+
|(33,[0,1,6,9,18,24,28,32],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |1.0  |[2.651981706440889,-2.651981706440889]  |[0.9341330269140776,0.06586697308592242]|0.0       |
|(33,[0,1,6,9,19,24,28],[2.0,1.0,1.0,1.0,1.0,1.0,1.0])        |0.0  |[0.7680940005109493,-0.7680940005109493]|[0.6831084436250248,0.3168915563749752] |0.0       |
|(33,[0,1,6,9,19,25,26,27],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0  |[-0.2294611813953613,0.2294611813953613]|[0.44288508827087586,0.5571149117291241]|1.0       |
|(33,[0,1,6,11,18,24,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |[1.8714533774539035,-1.8714533774539035]|[0.8666263557976371,0.13337364420236295]|0.0       |
|(33,[0,1,6,12,18,24,28,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |[109.61465683616704,-109.61465683616704]|[1.0,2.4829009824698132E-48]            |0.0       |
+-------------------------------------------------------------+-----+----------------------------------------+----------------------------------------+----------+
only showing top 5 rows

Print label, prediction and probability and ratios

Using multiple filters, we can count various metrics. Let us first count 0 and 1 in lables:

test_df_count_1 = test_df.filter(test_df['label'] == 1).count()
test_df_count_0 = test_df.filter(test_df['label'] == 0).count()
test_df_count_1, test_df_count_0

(25, 58)

Correct predictions:

cp = test_predictions.filter(
    test_predictions['label'] == 1).filter(
    test_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("Correct predictions: ", cp.count())

accuracy = (cp.count()) /  test_df_count_1
print(f"Accuracy: {accuracy}\n")


cp.show(5,False)

Correct predictions:  8
Accuracy: 0.32

+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1.0  |1.0       |[0.4744813212766959,0.5255186787233042] |
|1.0  |1.0       |[0.4265782884237564,0.5734217115762436] |
|1.0  |1.0       |[0.3013797841578714,0.6986202158421286] |
|1.0  |1.0       |[0.22766040693175582,0.7723395930682442]|
|1.0  |1.0       |[0.3704657803798026,0.6295342196201974] |
+-----+----------+----------------------------------------+
only showing top 5 rows

False positive:

fp = test_predictions.filter(
    test_predictions['label'] == 0).filter(
    test_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("False positive: ", fp.count())

fp.show(5,False)

False positive:  10
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|0.0  |1.0       |[0.44288508827087586,0.5571149117291241]|
|0.0  |1.0       |[7.257664043789407E-55,1.0]             |
|0.0  |1.0       |[7.457967610236931E-55,1.0]             |
|0.0  |1.0       |[0.47825658756012884,0.5217434124398712]|
|0.0  |1.0       |[0.26079650858208653,0.7392034914179135]|
+-----+----------+----------------------------------------+
only showing top 5 rows

False negative:

fn = test_predictions.filter(
    test_predictions['label'] == 1).filter(
    test_predictions['prediction'] == 0).select(
    ['label','prediction','probability'])

print("False negative: ", fn.count())

fn.show(5,False)

False negative:  17
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1.0  |0.0       |[0.9341330269140776,0.06586697308592242]|
|1.0  |0.0       |[0.5382947168966318,0.4617052831033681] |
|1.0  |0.0       |[0.7952431322320255,0.20475686776797455]|
|1.0  |0.0       |[0.8082534871399831,0.19174651286001684]|
|1.0  |0.0       |[0.5264577147149769,0.473542285285023]  |
+-----+----------+----------------------------------------+
only showing top 5 rows

Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

What is PySpark ML?