Hi,
This code will hang indefinitely at the last line (the .map()). Interestingly, if I run the same code at the beginning of my application (removing the .write step) it executes as expected. Otherwise, the code appears further along in my application which is where it hangs. The debugging message "I saw a row" never appears in the executor's standard output.
Note, this error occurs when running on a yarn cluster, but not on a standalone cluster or in local mode. I have tried running with num-cores=1 and 1 executor.
I have been working on this for a long time, any clues would be appreciated.
Regards,
def map_to_keys(row):
print("I saw a row", row["id"])
return (hash(row["id"]), row)
df.write.mode("overwrite").format("orc").save("/tmp/df_full")
df = spark.read.format("orc").load("/tmp/df_full")
rdd = df.rdd.map(map_to_keys)