[Spark SQL] issue about diffrence in memory size between DataFrame and RDD

classic Classic list List threaded Threaded
1 message Options
Lyx
Reply | Threaded
Open this post in threaded view
|

[Spark SQL] issue about diffrence in memory size between DataFrame and RDD

Lyx

Hello,

   I'm using Spark to deal with my project these days, however i noticed that when load data

stored in Hadoop hdfs, it seems that there is a huge difference in JVM memory size between using DataFrame

and using RDD format.Below lists my shell script  when using spark-shell, my original files(testData) are just ordinary text files

which is about 11GB when stored in hard disk,each line has the format of "Id1,Id2" where both Id1 and Id2 are some random numbers of int32.

/* code segment

import java.io.DataOutputStream
import java.util
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import scala.collection.mutable.ArrayBuffer

// this text file's size is 11GB in hard disk
var filePath = "hdfs://10.10.23.105:9000/testData"


val fields = Array.range(0, 2).map(i => StructField(s"col$i", IntegerType))
val schema: StructType = new StructType(fields)

val df: Dataset[Row] = spark.read.format("csv").schema(schema).load(filePath)

// the fisrt dataframe which turn out to be 5.5GB in memory
df.cache()
df.count()

// the second datafame which turn out to be 95GB in memory
df.rdd.cache()
df.rdd.count()

// the third rdd format which turn out to be 88GB in memory
val pureRDD= spark.sparkContext.textFile(filePath)
pureRDD.cache()
pureRDD.count()

//the line below gose wrong when i using collect() even driver has 200GB and executor have 300GB memory allocated
df.collect()

*/


  So here I encountered 2 problems:

Q1: I loaded and cached the very identical raw file into 3 types format respectively as showed above :DataFrame, DataFrame.rdd, RDD. Then I founded that DataFrame used just 5.5GB in my JVM , however df.rdd used nearly 95GB and RDD used about 69GB .So I'am wondering why RDD or DataFrame.rdd will take so much memory space even the original files are very small?


Q2: And I also noticed that when i called df.collect(),it will keep blocking without exeption or further information, while using RDD.collect() won't cause this problem and can return the result successfully.

(P.S. my driver is allocated 200GB alone with a 300GB executor in JVM heap, which is sufficient enough for such a collect action.)

   

   Hoping your attention and help

   Best regards with thanks!


 



Department of Engineering Mechanics

Zhejiang University

Hangzhou 310027,  P.R. China

Mobile: (+86)15158859317