Question regarding kryo and java encoders in datasets

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Question regarding kryo and java encoders in datasets

Devender Yadav
Hi All,



Good day!


I am using spark 2.4 and referring https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Bean class:

public class EmployeeBean implements Serializable {

    private Long id;
    private String name;
    private Long salary;
    private Integer age;

    // getters and setters

}


Spark Example:

    SparkSession spark = SparkSession.builder().master("local[4]").appName("play-with-spark").getOrCreate();

    List<EmployeeBean> employees1 = populateEmployees(1, 1_000_000);

    Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.kryo(EmployeeBean.class));
    Dataset<EmployeeBean> ds2 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));

    ds1.persist(StorageLevel.MEMORY_ONLY());
    long ds1Count = ds1.count();

    ds2.persist(StorageLevel.MEMORY_ONLY());
    long ds2Count = ds2.count();


I looked for storage in spark Web UI. Useful part -

ID  RDD Name                                           Size in Memory
2   LocalTableScan [value#0]                           56.5 MB
13  LocalTableScan [age#6, id#7L, name#8, salary#9L]   23.3 MB


Few questions:

  *   Shouldn't size of kryo serialized RDD be less than java serialized RDD instead of more than double size?

  *   I also tried MEMORY_ONLY_SER() mode and RDD size is the same. RDD as serialized Java objects should be stored as one byte array per partition. Shouldn't the size of persisted RDDs be less than deserialized ones?

  *   What exactly is adding kyro and bean encoders are doing while creating Dataset?

  *   Can I rename persisted RDDs for better readability?



Regards,
Devender

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

winmail.dat (21K) Download Attachment