RDD order preservation through transformations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

RDD order preservation through transformations

johan.grande.ext
Hi,

I'm a beginner using Spark with Scala and I'm having trouble understanding ordering in RDDs. I understand that RDDs are ordered (as they can be sorted) but that some transformations don't preserve order.

How can I know which transformations preserve order and which don't? Regarding map, for instance, this StackOverflow answer says map preserves order but this answer on this ML implies it doesn't. The scaladoc doesn't say explicitely. Which is it?

https://stackoverflow.com/a/31525843
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-ordering-gets-scrambled-tp5062p6482.html
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

--
Johan Grande
Sopra Steria for Orange


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

Ankit Maloo
AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation  can change sequence across a partition as partition is local and computation happens one record at a time. 

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <[hidden email]> wrote:
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions.

On 13 Sep 2017 18:39, "Ankit Maloo" <[hidden email]> wrote:
AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation  can change sequence across a partition as partition is local and computation happens one record at a time. 

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <[hidden email]> wrote:
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

lucas.gary@gmail.com
I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key).  

Is this similar?  

On 13 September 2017 at 10:46, Suzen, Mehmet <[hidden email]> wrote:
But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions.

On 13 Sep 2017 18:39, "Ankit Maloo" <[hidden email]> wrote:
AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation  can change sequence across a partition as partition is local and computation happens one record at a time. 

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <[hidden email]> wrote:
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

Suzen, Mehmet
I think it is one of the conceptual difference in Spark compare to
other languages, there is no indexing in plain RDDs, This was the
thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of
failure. Also not sure if partitions are ordered. Can you get the same
sequence of partitions in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo" <[hidden email]> wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet" <[hidden email]> wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover elements in other partitions.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: RDD order preservation through transformations

johan.grande.ext
Well if the order cannot be guaranteed in case of a failure (or at all since failure can happen transparently), what does it mean to sort an RDD (method sortBy)?


On 2017-09-14 03:36 CEST [hidden email] wrote:

I think it is one of the conceptual difference in Spark compare to other languages, there is no indexing in plain RDDs, This was the thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of failure. Also not sure if partitions are ordered. Can you get the same sequence of partitions in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo" <[hidden email]> wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet" <[hidden email]> wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover elements in other partitions.

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: RDD order preservation through transformations

johan.grande.ext
In reply to this post by Suzen, Mehmet
(Sorry Mehmet, I'm seeing just now your first reply with the link to SO; it had first gone to my spam folder :-/ )


On 2017-09-14 10:02 CEST, GRANDE Johan Ext DTSI/DSI wrote:

Well if the order cannot be guaranteed in case of a failure (or at all since failure can happen transparently), what does it mean to sort an RDD (method sortBy)?


On 2017-09-14 03:36 CEST [hidden email] wrote:

I think it is one of the conceptual difference in Spark compare to other languages, there is no indexing in plain RDDs, This was the thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of failure. Also not sure if partitions are ordered. Can you get the same sequence of partitions in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo" <[hidden email]> wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet" <[hidden email]> wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover elements in other partitions.

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: RDD order preservation through transformations

johan.grande.ext
In reply to this post by lucas.gary@gmail.com

In several situations I would like to zip RDDs knowing that their order matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I would like to do:

 

myData.zip(myModel.predict(myData))

 

Also the first column in my RDD is a timestamp which I don’t want to be a part of the model, so in fact I would like to split the first column out of my RDD, then do:

 

myData.zip(myModel.predict(myData.map(dropTimestamp)))

 

Moreover I’d like my data to be scaled and go through a principal component analysis first, so the main steps would be like:

 

val noTs = myData.map(dropTimestamp)

val scaled = scaler.transform(noTs)

val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows

val clusters = myModel.predict(projected)

val result = myData.zip(clusters)

 

Do you think there’s a chance that the 4 transformations above would preserve order so the zip at the end would be correct?

 

 

On 2017-09-13 19:51 CEST, [hidden email] wrote :

 

I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key).  

 

Is this similar?  

 

On 13 September 2017 at 10:46, Suzen, Mehmet <[hidden email]> wrote:

But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions.

 

On 13 Sep 2017 18:39, "Ankit Maloo" <[hidden email]> wrote:

AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation  can change sequence across a partition as partition is local and computation happens one record at a time. 

 

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <[hidden email]> wrote:

I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

 

 

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

geoHeil
Usually spark ml Models specify the columns they use for training. i.e. you would only select your columns (X) for model training but metadata i.e. target labels or your date column  (y) would still be present for each row.

<[hidden email]> schrieb am Do., 14. Sep. 2017 um 10:42 Uhr:

In several situations I would like to zip RDDs knowing that their order matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I would like to do:

 

myData.zip(myModel.predict(myData))

 

Also the first column in my RDD is a timestamp which I don’t want to be a part of the model, so in fact I would like to split the first column out of my RDD, then do:

 

myData.zip(myModel.predict(myData.map(dropTimestamp)))

 

Moreover I’d like my data to be scaled and go through a principal component analysis first, so the main steps would be like:

 

val noTs = myData.map(dropTimestamp)

val scaled = scaler.transform(noTs)

val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows

val clusters = myModel.predict(projected)

val result = myData.zip(clusters)

 

Do you think there’s a chance that the 4 transformations above would preserve order so the zip at the end would be correct?

 

 

On 2017-09-13 19:51 CEST, [hidden email] wrote :

 

I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key).  

 

Is this similar?  

 

On 13 September 2017 at 10:46, Suzen, Mehmet <[hidden email]> wrote:

But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions.

 

On 13 Sep 2017 18:39, "Ankit Maloo" <[hidden email]> wrote:

AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation  can change sequence across a partition as partition is local and computation happens one record at a time. 

 

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <[hidden email]> wrote:

I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

 

 

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

Suzen, Mehmet
In reply to this post by johan.grande.ext
On 14 September 2017 at 10:42,  <[hidden email]> wrote:

> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you think there’s a chance that the 4 transformations above would
> preserve order so the zip at the end would be correct?

AFAIK, No. The sequence of transformation you have will not guarantee
to preserve order.
First, apply zip, then you need to keep track of indices in the
subsequent transformations,
with `_2`, as zip returns tuples.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: RDD order preservation through transformations

johan.grande.ext
Thanks all for your answers. After reading the provided links I am still uncertain of the details of what I'd need to do to get my calculations right with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of the libs and I think they'll be better suited to my needs.

Best,
Johan Grande


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: RDD order preservation through transformations

Suzen, Mehmet
Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering
issues are different there. Maybe you could create minimally large
enough simulated data and example series of transformations as an
example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD
<[hidden email]>

| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
and any documents, files or previous e-mail messages attached to it,
may contain confidential information that is legally privileged. If
you are not the intended recipient or a person responsible for
delivering it to the intended recipient, you are hereby notified that
any disclosure, copying, distribution or use of any of the information
contained in or attached to this transmission is STRICTLY PROHIBITED
within the applicable law. If you have received this transmission in
error, please: (1) immediately notify me by reply e-mail to
[hidden email],  and (2) destroy the original transmission and its
attachments without reading or saving in any manner. |


On 15 September 2017 at 09:44,  <[hidden email]> wrote:

> Thanks all for your answers. After reading the provided links I am still uncertain of the details of what I'd need to do to get my calculations right with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande
>
>
> _________________________________________________________________________________________________________________________
>
> Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
> Thank you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: RDD order preservation through transformations

johan.grande.ext
Well, the dataframes make it easier to work on some columns of the data only and to store results in new columns, removing the need to zip it all back together and thus to preserve order.


On 2017-09-05 14:04 CEST, [hidden email] wrote:

Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering issues are different there. Maybe you could create minimally large enough simulated data and example series of transformations as an example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD
<[hidden email]>



On 15 September 2017 at 09:44,  <[hidden email]> wrote:
> Thanks all for your answers. After reading the provided links I am still uncertain of the details of what I'd need to do to get my calculations right with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]