Spark Inner Join on pivoted datasets results empty dataset

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Inner Join on pivoted datasets results empty dataset

Anil Langote
Hi All, 

I have a requirement to pivot multiple columns using single columns, the pivot API doesn't support doing that hence I have been doing pivot for two columns and then trying to merge the dataset the result is producing empty dataset. Below is the sudo code

Main dataset => 33 columns (30 columns are string and 2 columns are of type double array lets say vector1 and vector2, 1 column Decider which has 0 & 1 values)

String grouByColumns =  "col1,col2,col3,col4,col5,col6.......col30";
Vector columns : Vector1 and Vector2

i do pivot like below

List< Object > values = new ArrayList<Object>();
values.add("0");
values.add("1")

Dataset<Row> pivot1 = mainDataset.grouBy(grouByColumns).pivot("Decider",values).agg(functions.callUDF(CUSTOM_UDAF,mainDataset.col("Vector1"));
pivot1 = pivot1.withColumRenamed("0","Vector1_0");
pivot1 = pivot1.withColumRenamed("1","Vector1_1");

Count on pivot1 = 12856

Dataset<Row> pivot2 = mainDataset.grouBy(grouByColumns).pivot("Decider",values).agg(functions.callUDF(CUSTOM_UDAF,mainDataset.col("Vector2"));
pivot2 = pivot2.withColumRenamed("0","Vector2_0");
pivot2 = pivot2.withColumRenamed("1","Vector2_1");

Count on pivot2 = 12856
 
Dataset<Row> finalDataset = pivot1.join(pivot2,Seq<grouByColumns >);

Count on pivot1 = 0 ? Why this sould be 12856  right?

The same code works on local with less columns and 100 records. 

Is there anything i am missing here is there any better way to pivot the multiple columns i can not do combine because my aggregation columns are array of doubles.

The pivot1 & pivot2 dataset derived by same parent dataset the group by columns are same all i am doing is inner join on these two dataset with same group by columns why it doesn't work?

Thank you
Anil Langote
Reply | Threaded
Open this post in threaded view
|

Re: Spark Inner Join on pivoted datasets results empty dataset

Anil Langote
Is there any limit on number of columns used in inner join ?

Thank you
Anil Langote

Sent from my iPhone
_____________________________
From: Anil Langote <[hidden email]>
Sent: Thursday, October 19, 2017 5:01 PM
Subject: Spark Inner Join on pivoted datasets results empty dataset
To: user <[hidden email]>


Hi All, 

I have a requirement to pivot multiple columns using single columns, the pivot API doesn't support doing that hence I have been doing pivot for two columns and then trying to merge the dataset the result is producing empty dataset. Below is the sudo code

Main dataset => 33 columns (30 columns are string and 2 columns are of type double array lets say vector1 and vector2, 1 column Decider which has 0 & 1 values)

String grouByColumns =  "col1,col2,col3,col4,col5,col6.......col30";
Vector columns : Vector1 and Vector2

i do pivot like below

List< Object > values = new ArrayList<Object>();
values.add("0");
values.add("1")

Dataset<Row> pivot1 = mainDataset.grouBy(grouByColumns).pivot("Decider",values).agg(functions.callUDF(CUSTOM_UDAF,mainDataset.col("Vector1"));
pivot1 = pivot1.withColumRenamed("0","Vector1_0");
pivot1 = pivot1.withColumRenamed("1","Vector1_1");

Count on pivot1 = 12856

Dataset<Row> pivot2 = mainDataset.grouBy(grouByColumns).pivot("Decider",values).agg(functions.callUDF(CUSTOM_UDAF,mainDataset.col("Vector2"));
pivot2 = pivot2.withColumRenamed("0","Vector2_0");
pivot2 = pivot2.withColumRenamed("1","Vector2_1");

Count on pivot2 = 12856
 
Dataset<Row> finalDataset = pivot1.join(pivot2,Seq<grouByColumns >);

Count on pivot1 = 0 ? Why this sould be 12856  right?

The same code works on local with less columns and 100 records. 

Is there anything i am missing here is there any better way to pivot the multiple columns i can not do combine because my aggregation columns are array of doubles.

The pivot1 & pivot2 dataset derived by same parent dataset the group by columns are same all i am doing is inner join on these two dataset with same group by columns why it doesn't work?

Thank you
Anil Langote