This post has NOT been accepted by the mailing list yet.
I'm currently evaluating Spark as a data pipeline solution. The pyspark version sounds especially exciting and I've been tinkering with it. But some of my data sets are 100Billion+ rows and I need to make sure Spark can serve my needs.
I'm currently testing a join of two data sets, Base and Skewed. They're both 100 million rows and they look like the following.
I have two tests:
1. join Base to itself, sum the "nums" and write out to HDFS
2. same as 1 except join Base to Skewed
(I realize the outputs are contrived and meaningless, but again, I'm testing limits here)
Test 1 works amazingly fast.
Test 2, however, works well on all but one of the nodes on the cluster. That node runs out of memory quickly and dies. All of those nodes have 10 gigs of memory available to the spark executor and remaining ~60 gigs memory available to python. AKAIK more than enough to hold the entire datasets many times over.
See code below.
I'm assuming there's a build up of the skewed 50million rows on the one particular node and is running out of memory while it tries to merge them.
So is this normal? A known problem? If it is, what can I do to remedy the issue? Any further experiments I can run?
Thanks for any time you can spare
base = sc.textFile("100million/part-*")
.keyBy(lambda x: x["id"])