spark.python.worker.reuse not working as expected

classic Classic list List threaded Threaded
1 message Options
df
Reply | Threaded
Open this post in threaded view
|

spark.python.worker.reuse not working as expected

df
given this code block 

def return_pid(_): yield os.getpid() 
spark = SparkSession.builder.getOrCreate() 
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect()) 
print(pids) 
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect()) 
print(pids) 

I was expecting that the same python process ids will be printed twice. instead, completely different Python process ids are being printed. 

spark.python.worker.reuse is true but default. but this unexpected behaviors still occurs if spark.python.worker.reuse=true explicitly.