How to do efficient self join with Spark-SQL and Scala

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to do efficient self join with Spark-SQL and Scala

Chetan Khatri
Dear Spark Users,

I came across little weird MSSQL Query to replace with Spark and I am like no clue how to do it in an efficient way with Scala + SparkSQL. Can someone please throw light. I can create view of DataFrame and do it as spark.sql (query) but I would like to do it with Scala + Spark way. 

Sample:

select a.student_id,a.candidate_id, a.student_name, a.student_standard, a.student_city, b.teacher_name, a.student_status ,
a.approval_id, case when a.approval_id = 2 and (a.academic_start_date is null
and not exists (
select student_id from tbl_student where candidate_id = c.candidate_id and approval_id = 2
and academic_start_date is null
)
) then 'Yes'
else 'No'
end as is_current
from tbl_student a inner join tbl_teacher b on a.candidate_id = b.candidate_id inner join tbl_network c
on c.candidate_id = a.candidate_id

Thank you.

Reply | Threaded
Open this post in threaded view
|

Re: How to do efficient self join with Spark-SQL and Scala

hemant singh
You can use spark dataframe 'when' 'otherwise' clause to replace SQL case statement.

This piece will be required to calculate before -
 
'select student_id from tbl_student where candidate_id = c.candidate_id and approval_id = 2
and academic_start_date is null'

Take the count of above DF after joining tbl_student and tbl_network DF's based on condition above.

Overall you can join all three tables first and rest of the query on the same dataframe.


On Sat, Sep 22, 2018 at 1:08 AM Chetan Khatri <[hidden email]> wrote:
Dear Spark Users,

I came across little weird MSSQL Query to replace with Spark and I am like no clue how to do it in an efficient way with Scala + SparkSQL. Can someone please throw light. I can create view of DataFrame and do it as spark.sql (query) but I would like to do it with Scala + Spark way. 

Sample:

select a.student_id,a.candidate_id, a.student_name, a.student_standard, a.student_city, b.teacher_name, a.student_status ,
a.approval_id, case when a.approval_id = 2 and (a.academic_start_date is null
and not exists (
select student_id from tbl_student where candidate_id = c.candidate_id and approval_id = 2
and academic_start_date is null
)
) then 'Yes'
else 'No'
end as is_current
from tbl_student a inner join tbl_teacher b on a.candidate_id = b.candidate_id inner join tbl_network c
on c.candidate_id = a.candidate_id

Thank you.