// this text file's size is 11GB in hard disk var filePath = "hdfs://10.10.23.105:9000/testData"
val fields = Array.range(0, 2).map(i => StructField(s"col$i", IntegerType)) val schema: StructType = new StructType(fields)
val df: Dataset[Row] = spark.read.format("csv").schema(schema).load(filePath)
// the fisrt dataframe which turn out to be 5.5GB in memory df.cache() df.count()
// the second datafame which turn out to be 95GB in memory df.rdd.cache() df.rdd.count()
// the third rdd format which turn out to be 88GB in memory val pureRDD= spark.sparkContext.textFile(filePath) pureRDD.cache() pureRDD.count()
//the line below gose wrong when i using collect() even driver has 200GB and executor have 300GB memory allocated df.collect()
So here I encountered 2 problems:
Q1: I loaded and cached the very identical raw file into 3 types format respectivelyas showed above :DataFrame, DataFrame.rdd, RDD. Then I founded that DataFrame used just 5.5GB in my JVM , however df.rdd used nearly 95GB and RDD used about 69GB .So I'am wondering why RDD or DataFrame.rdd will take so much memory space even the original files are very small?
Q2: And I also noticed that when i called df.collect()，it will keep blocking without exeption or further information, while using RDD.collect() won't cause this problem and can return the result successfully.
(P.S. my driver is allocated 200GB alone with a 300GB executor in JVM heap, which is sufficient enough for such a collect action.)