Pyspark application hangs (no error messages) on Python RDD .map

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Pyspark application hangs (no error messages) on Python RDD .map

Daniel Stojanov
Hi,

This code will hang indefinitely at the last line (the .map()). Interestingly, if I run the same code at the beginning of my application (removing the .write step) it executes as expected. Otherwise, the code appears further along in my application which is where it hangs. The debugging message "I saw a row" never appears in the executor's standard output.

Note, this error occurs when running on a yarn cluster, but not on a standalone cluster or in local mode. I have tried running with num-cores=1 and 1 executor.

I have been working on this for a long time, any clues would be appreciated.

Regards,


def map_to_keys(row):
    print("I saw a row", row["id"])
    return (hash(row["id"]), row)

df.write.mode("overwrite").format("orc").save("/tmp/df_full")
df = spark.read.format("orc").load("/tmp/df_full")
rdd = df.rdd.map(map_to_keys)