Filter one dataset based on values from another

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Filter one dataset based on values from another

lsn24
Hi,
  I have one  dataset with parameters and another with data that needs to
apply/ filter based on the first dataset (Parameter dataset).

*Scenario is as follows:*

    For each row in parameter dataset, I need to apply the parameter row to
the second dataset.I will end up having multiple dataset. for each second
dataset i need to run  a bunch of calculation.

How can I achieve this in spark?

*Pseudo code for better understanding:*

Dataset<Parameter> paramsDataset = sparkSession.sql("select * from
paramsview");

Dataset<myData> myDataset = sparkSession.sql("select * from tempview");


Question: For each row in paramsDataset, I need to filter myDataset and run
some calculations on it. Is it possible to do that ? if not whats the best
way to solve it?

Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Filter one dataset based on values from another

JayeshLalwani
What columns do you want to filter myDataSet on? What are the corresponding columns in paramsDataSet?

You can easily do what you want using a inner  join. For example, if tempview and paramsview both have a column, say employeeID. You can do this with the SQl

sparkSession.sql("Select * from tempview inner join paramsview on tempview.employeeId = paramsview.employeeId")

´╗┐On 5/1/18, 12:03 AM, "lsn24" <[hidden email]> wrote:

    Hi,
      I have one  dataset with parameters and another with data that needs to
    apply/ filter based on the first dataset (Parameter dataset).
   
    *Scenario is as follows:*
   
        For each row in parameter dataset, I need to apply the parameter row to
    the second dataset.I will end up having multiple dataset. for each second
    dataset i need to run  a bunch of calculation.
   
    How can I achieve this in spark?
   
    *Pseudo code for better understanding:*
   
    Dataset<Parameter> paramsDataset = sparkSession.sql("select * from
    paramsview");
   
    Dataset<myData> myDataset = sparkSession.sql("select * from tempview");
   
   
    Question: For each row in paramsDataset, I need to filter myDataset and run
    some calculations on it. Is it possible to do that ? if not whats the best
    way to solve it?
   
    Thanks
   
   
   
   
    --
    Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_&d=DwICAg&c=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE&r=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4&m=2DBXMR9Vazi5EAA7gtp78AhvgGj1xwkacIgDWUOOOS4&s=baasFvkvrjKfQoZTws7KEWp24oBkrLJWvUz1gV5UjFQ&e=
   
    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]
   
   

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Filter one dataset based on values from another

lsn24
I don't think inner join will solve my problem.

*For each row in* paramsDataset, I need to filter mydataset. And then I need
to run a bunch of calculation on filtered myDataset.

Say for example paramsDataset has three employee age ranges . Eg:
20-30,30-50, 50-60 and regions USA,Canada.

myDataset has all employees information for three years. Like the days a
person came to work , took day off etc.

I need to calculate the average number of days employee worked per age range
for different regions. Average day off per age range etc.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]