Add column value in the dataset on the basis of a condition

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Add column value in the dataset on the basis of a condition

Devender Yadav
Hi All,


useful code:

public class EmployeeBean implements Serializable {

    private Long id;

    private String name;

    private Long salary;

    private Integer age;

    // getters and setters

}


Relevant spark code:

SparkSession spark = SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
List<EmployeeBean> employees1 = populateEmployees(1, 10);

Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));
ds1.show();
ds1.printSchema();

Dataset<Row> ds2 = ds1.where("age is null").withColumn("is_age_null", lit(true));
Dataset<Row> ds3 = ds1.where("age is not null").withColumn("is_age_null", lit(false));

Dataset<Row> ds4 = ds2.union(ds3);
ds4.show();


Relevant Output:


ds1

+----+---+----+------+
| age| id|name|salary|
+----+---+----+------+
|null|  1|dev1| 11000|
|   2|  2|dev2| 12000|
|null|  3|dev3| 13000|
|   4|  4|dev4| 14000|
|null|  5|dev5| 15000|
+----+---+----+------+


ds4

+----+---+----+------+-----------+
| age| id|name|salary|is_age_null|
+----+---+----+------+-----------+
|null|  1|dev1| 11000|       true|
|null|  3|dev3| 13000|       true|
|null|  5|dev5| 15000|       true|
|   2|  2|dev2| 12000|      false|
|   4|  4|dev4| 14000|      false|
+----+---+----+------+-----------+


Is there any better solution to add this column in the dataset rather than creating two datasets and performing union?

<https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition>



Regards,
Devender

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

winmail.dat (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add column value in the dataset on the basis of a condition

Shahab Yunus
Have you tried using withColumn? You can add a boolean column based on whether the age exists or not and then drop the older age column. You wouldn't need union of dataframes then

On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav <[hidden email]> wrote:
Hi All,


useful code:

public class EmployeeBean implements Serializable {

    private Long id;

    private String name;

    private Long salary;

    private Integer age;

    // getters and setters

}


Relevant spark code:

SparkSession spark = SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
List<EmployeeBean> employees1 = populateEmployees(1, 10);

Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));
ds1.show();
ds1.printSchema();

Dataset<Row> ds2 = ds1.where("age is null").withColumn("is_age_null", lit(true));
Dataset<Row> ds3 = ds1.where("age is not null").withColumn("is_age_null", lit(false));

Dataset<Row> ds4 = ds2.union(ds3);
ds4.show();


Relevant Output:


ds1

+----+---+----+------+
| age| id|name|salary|
+----+---+----+------+
|null|  1|dev1| 11000|
|   2|  2|dev2| 12000|
|null|  3|dev3| 13000|
|   4|  4|dev4| 14000|
|null|  5|dev5| 15000|
+----+---+----+------+


ds4

+----+---+----+------+-----------+
| age| id|name|salary|is_age_null|
+----+---+----+------+-----------+
|null|  1|dev1| 11000|       true|
|null|  3|dev3| 13000|       true|
|null|  5|dev5| 15000|       true|
|   2|  2|dev2| 12000|      false|
|   4|  4|dev4| 14000|      false|
+----+---+----+------+-----------+


Is there any better solution to add this column in the dataset rather than creating two datasets and performing union?

<https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition>



Regards,
Devender

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Add column value in the dataset on the basis of a condition

Shahab Yunus
Sorry Devender, I hit the send button sooner by mistake. I meant to add more info.

So what I was trying to say was that you can use withColumn with when/otherwise clauses to add a column conditionally. See an example here:

On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus <[hidden email]> wrote:
Have you tried using withColumn? You can add a boolean column based on whether the age exists or not and then drop the older age column. You wouldn't need union of dataframes then

On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav <[hidden email]> wrote:
Hi All,


useful code:

public class EmployeeBean implements Serializable {

    private Long id;

    private String name;

    private Long salary;

    private Integer age;

    // getters and setters

}


Relevant spark code:

SparkSession spark = SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
List<EmployeeBean> employees1 = populateEmployees(1, 10);

Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));
ds1.show();
ds1.printSchema();

Dataset<Row> ds2 = ds1.where("age is null").withColumn("is_age_null", lit(true));
Dataset<Row> ds3 = ds1.where("age is not null").withColumn("is_age_null", lit(false));

Dataset<Row> ds4 = ds2.union(ds3);
ds4.show();


Relevant Output:


ds1

+----+---+----+------+
| age| id|name|salary|
+----+---+----+------+
|null|  1|dev1| 11000|
|   2|  2|dev2| 12000|
|null|  3|dev3| 13000|
|   4|  4|dev4| 14000|
|null|  5|dev5| 15000|
+----+---+----+------+


ds4

+----+---+----+------+-----------+
| age| id|name|salary|is_age_null|
+----+---+----+------+-----------+
|null|  1|dev1| 11000|       true|
|null|  3|dev3| 13000|       true|
|null|  5|dev5| 15000|       true|
|   2|  2|dev2| 12000|      false|
|   4|  4|dev4| 14000|      false|
+----+---+----+------+-----------+


Is there any better solution to add this column in the dataset rather than creating two datasets and performing union?

<https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition>



Regards,
Devender

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Add column value in the dataset on the basis of a condition

Devender Yadav
Thanks, Yunus. It solved my problem.


Regards,
Devender
________________________________
From: Shahab Yunus <[hidden email]>
Sent: Tuesday, December 18, 2018 8:27:51 PM
To: Devender Yadav
Cc: [hidden email]
Subject: Re: Add column value in the dataset on the basis of a condition

Sorry Devender, I hit the send button sooner by mistake. I meant to add more info.

So what I was trying to say was that you can use withColumn with when/otherwise clauses to add a column conditionally. See an example here:
https://stackoverflow.com/questions/34908448/spark-add-column-to-dataframe-conditionally<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F34908448%2Fspark-add-column-to-dataframe-conditionally&data=01%7C01%7Cdevender.yadav%40impetus.co.in%7C89cce9efc5c1492d70a908d664f937db%7Cb9dfa90567614a548aa07b7647bbafb8%7C0&sdata=pmOmAeobt4fXOuZB2VKFnl0ckMTfZ4LeHyocQ7o%2B5sA%3D&reserved=0>

On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus <[hidden email]<mailto:[hidden email]>> wrote:
Have you tried using withColumn? You can add a boolean column based on whether the age exists or not and then drop the older age column. You wouldn't need union of dataframes then

On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav <[hidden email]<mailto:[hidden email]>> wrote:
Hi All,


useful code:

public class EmployeeBean implements Serializable {

    private Long id;

    private String name;

    private Long salary;

    private Integer age;

    // getters and setters

}


Relevant spark code:

SparkSession spark = SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
List<EmployeeBean> employees1 = populateEmployees(1, 10);

Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));
ds1.show();
ds1.printSchema();

Dataset<Row> ds2 = ds1.where("age is null").withColumn("is_age_null", lit(true));
Dataset<Row> ds3 = ds1.where("age is not null").withColumn("is_age_null", lit(false));

Dataset<Row> ds4 = ds2.union(ds3);
ds4.show();


Relevant Output:


ds1

+----+---+----+------+
| age| id|name|salary|
+----+---+----+------+
|null|  1|dev1| 11000|
|   2|  2|dev2| 12000|
|null|  3|dev3| 13000|
|   4|  4|dev4| 14000|
|null|  5|dev5| 15000|
+----+---+----+------+


ds4

+----+---+----+------+-----------+
| age| id|name|salary|is_age_null|
+----+---+----+------+-----------+
|null|  1|dev1| 11000|       true|
|null|  3|dev3| 13000|       true|
|null|  5|dev5| 15000|       true|
|   2|  2|dev2| 12000|      false|
|   4|  4|dev4| 14000|      false|
+----+---+----+------+-----------+


Is there any better solution to add this column in the dataset rather than creating two datasets and performing union?

<https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53834286%2Fadd-column-value-in-spark-dataset-on-the-basis-of-the-condition&data=01%7C01%7Cdevender.yadav%40impetus.co.in%7C89cce9efc5c1492d70a908d664f937db%7Cb9dfa90567614a548aa07b7647bbafb8%7C0&sdata=0WTnk7a3YSJcqrs87zH5k38Mh6kStY7%2Fn%2BGxffGkJY0%3D&reserved=0>>



Regards,
Devender

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]<mailto:[hidden email]>

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

winmail.dat (29K) Download Attachment