Can I collect Dataset[Row] to driver without converting it to Array [Row]?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Can I collect Dataset[Row] to driver without converting it to Array [Row]?

maqy1995@outlook.com

When the data is stored in the Dataset [Row] format, the memory usage is very small.

When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

So, can I collect Dataset[Row] to driver and keep its data format?

 

Best regards,

maqy

 

Reply | Threaded
Open this post in threaded view
|

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

MidwestMike
What would you do with it once you get it into driver in a Dataset[Row]?

Sent from my iPhone

On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:



When the data is stored in the Dataset [Row] format, the memory usage is very small.

When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

So, can I collect Dataset[Row] to driver and keep its data format?

 

Best regards,

maqy

 

Reply | Threaded
Open this post in threaded view
|

回复: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

maqy1995@outlook.com

I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020422 16:09
收件人: [hidden email]
抄送: [hidden email]
主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

 

What would you do with it once you get it into driver in a Dataset[Row]?

Sent from my iPhone



On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:



When the data is stored in the Dataset [Row] format, the memory usage is very small.

When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

So, can I collect Dataset[Row] to driver and keep its data format?

 

Best regards,

maqy

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

Andrew Melo
In reply to this post by MidwestMike
Hi Maqy

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:
>
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

(I presume you're using the python tensorflow API, if you're not, please ignore)

There is a JIRA/PR ([1] [2]) which proposes to add native support for
Arrow serialization to python,

Under the hood, Spark is already serializing into Arrow format to
transmit to python, it's just additionally doing an unconditional
conversion to pandas once it reaches the python runner -- which is
good if you're using pandas, not so great if pandas isn't what you
operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly.

Cheers,
Andrew
[1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783

>
> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.
>
>
>
> Best regards,
>
> maqy
>
>
>
> 发件人: Michael Artz
> 发送时间: 2020年4月22日 16:09
> 收件人: maqy
> 抄送: [hidden email]
> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?
>
>
>
> What would you do with it once you get it into driver in a Dataset[Row]?
>
> Sent from my iPhone
>
>
>
> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:
>
> 
>
> When the data is stored in the Dataset [Row] format, the memory usage is very small.
>
> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.
>
> So, can I collect Dataset[Row] to driver and keep its data format?
>
>
>
> Best regards,
>
> maqy
>
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

Tang Jinxin
maybe could try someway like foreachpartition in foreachrdd,which will not together to driver take too extra consumption.

xiaoxingstack
邮箱:xiaoxingstack@...

签名由 网易邮箱大师 定制

On 04/22/2020 21:02, [hidden email] wrote:
Hi Maqy

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:
>
> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

(I presume you're using the python tensorflow API, if you're not, please ignore)

There is a JIRA/PR ([1] [2]) which proposes to add native support for
Arrow serialization to python,

Under the hood, Spark is already serializing into Arrow format to
transmit to python, it's just additionally doing an unconditional
conversion to pandas once it reaches the python runner -- which is
good if you're using pandas, not so great if pandas isn't what you
operate on. The JIRA above would let you receive the arrow buffers
(that already exist) directly.

Cheers,
Andrew
[1] https://issues.apache.org/jira/browse/SPARK-30153
[2] https://github.com/apache/spark/pull/26783

>
> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.
>
>
>
> Best regards,
>
> maqy
>
>
>
> 发件人: Michael Artz
> 发送时间: 2020年4月22日 16:09
> 收件人: maqy
> 抄送: [hidden email]
> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?
>
>
>
> What would you do with it once you get it into driver in a Dataset[Row]?
>
> Sent from my iPhone
>
>
>
> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:
>
> 
>
> When the data is stored in the Dataset [Row] format, the memory usage is very small.
>
> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.
>
> So, can I collect Dataset[Row] to driver and keep its data format?
>
>
>
> Best regards,
>
> maqy
>
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

回复: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

maqy1995@outlook.com
In reply to this post by Andrew Melo

Hi Andrew,

    Thank you for your reply, I am using the scala api of spark, and the tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid in this situation?

In addition, the current bottleneck of the application is that the amount of data transferred through the network(use collect()) is too large, and the deserialization seems to take some time.

 

Best wishes,

maqy

 

发件人: [hidden email]
发送时间: 2020422 21:02
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:

> 

> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

 

(I presume you're using the python tensorflow API, if you're not, please ignore)

 

There is a JIRA/PR ([1] [2]) which proposes to add native support for

Arrow serialization to python,

 

Under the hood, Spark is already serializing into Arrow format to

transmit to python, it's just additionally doing an unconditional

conversion to pandas once it reaches the python runner -- which is

good if you're using pandas, not so great if pandas isn't what you

operate on. The JIRA above would let you receive the arrow buffers

(that already exist) directly.

 

Cheers,

Andrew

[1] https://issues.apache.org/jira/browse/SPARK-30153

[2] https://github.com/apache/spark/pull/26783

 

> 

> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 发件人: Michael Artz

> 发送时间: 2020422 16:09

> 收件人: maqy

> 抄送: [hidden email]

> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

> 

> 

> 

> What would you do with it once you get it into driver in a Dataset[Row]?

> 

> Sent from my iPhone

> 

> 

> 

> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:

> 

> 

> 

> When the data is stored in the Dataset [Row] format, the memory usage is very small.

> 

> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

> 

> So, can I collect Dataset[Row] to driver and keep its data format?

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 

 

---------------------------------------------------------------------

To unsubscribe e-mail: [hidden email]

 

r

 

Reply | Threaded
Open this post in threaded view
|

回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

Tang Jinxin
In reply to this post by Andrew Melo
Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  follow:firstly,try not collect to driver if not nessessary,instead (use foreachpartition)send data from ececutors;secondly,if not use some high performance  ser/deser like kryo, we could have a try.As a summary,I recommend the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack
邮箱:xiaoxingstack@...

签名由 网易邮箱大师 定制

2020年04月22日 23:24[hidden email] 写道:

Hi Andrew,

    Thank you for your reply, I am using the scala api of spark, and the tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid in this situation?

In addition, the current bottleneck of the application is that the amount of data transferred through the network(use collect()) is too large, and the deserialization seems to take some time.

 

Best wishes,

maqy

 

发件人: [hidden email]
发送时间: 2020422 21:02
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:

> 

> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

 

(I presume you're using the python tensorflow API, if you're not, please ignore)

 

There is a JIRA/PR ([1] [2]) which proposes to add native support for

Arrow serialization to python,

 

Under the hood, Spark is already serializing into Arrow format to

transmit to python, it's just additionally doing an unconditional

conversion to pandas once it reaches the python runner -- which is

good if you're using pandas, not so great if pandas isn't what you

operate on. The JIRA above would let you receive the arrow buffers

(that already exist) directly.

 

Cheers,

Andrew

[1] https://issues.apache.org/jira/browse/SPARK-30153

[2] https://github.com/apache/spark/pull/26783

 

> 

> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 发件人: Michael Artz

> 发送时间: 2020422 16:09

> 收件人: maqy

> 抄送: [hidden email]

> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

> 

> 

> 

> What would you do with it once you get it into driver in a Dataset[Row]?

> 

> Sent from my iPhone

> 

> 

> 

> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:

> 

> 

> 

> When the data is stored in the Dataset [Row] format, the memory usage is very small.

> 

> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

> 

> So, can I collect Dataset[Row] to driver and keep its data format?

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 

 

---------------------------------------------------------------------

To unsubscribe e-mail: [hidden email]

 

r

 

Reply | Threaded
Open this post in threaded view
|

回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

maqy1995@outlook.com

Hi Jinxin,

Thanks for your suggestions, I will try to use foreachpartition later.

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020423 7:31
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  follow:firstly,try not collect to driver if not nessessary,instead (use foreachpartition)send data from ececutors;secondly,if not use some high performance  ser/deser like kryo, we could have a try.As a summary,I recommend the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack

签名由 网易邮箱大师 定制

20200422 23:24[hidden email] 写道:

Hi Andrew,

    Thank you for your reply, I am using the scala api of spark, and the tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid in this situation?

In addition, the current bottleneck of the application is that the amount of data transferred through the network(use collect()) is too large, and the deserialization seems to take some time.

 

Best wishes,

maqy

 

发件人: [hidden email]
发送时间: 2020422 21:02
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:

> 

> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

 

(I presume you're using the python tensorflow API, if you're not, please ignore)

 

There is a JIRA/PR ([1] [2]) which proposes to add native support for

Arrow serialization to python,

 

Under the hood, Spark is already serializing into Arrow format to

transmit to python, it's just additionally doing an unconditional

conversion to pandas once it reaches the python runner -- which is

good if you're using pandas, not so great if pandas isn't what you

operate on. The JIRA above would let you receive the arrow buffers

(that already exist) directly.

 

Cheers,

Andrew

[1] https://issues.apache.org/jira/browse/SPARK-30153

[2] https://github.com/apache/spark/pull/26783

 

> 

> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 发件人: Michael Artz

> 发送时间: 2020422 16:09

> 收件人: maqy

> 抄送: [hidden email]

> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

> 

> 

> 

> What would you do with it once you get it into driver in a Dataset[Row]?

> 

> Sent from my iPhone

> 

> 

> 

> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:

> 

> 

> 

> When the data is stored in the Dataset [Row] format, the memory usage is very small.

> 

> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

> 

> So, can I collect Dataset[Row] to driver and keep its data format?

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 

 

---------------------------------------------------------------------

To unsubscribe e-mail: [hidden email]

 

r

 

 

Reply | Threaded
Open this post in threaded view
|

回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

maqy1995@outlook.com
In reply to this post by Tang Jinxin

Hi Jinxin,

Thanks for your suggestions, I will try to use foreachpartition later.

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020423 7:31
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  follow:firstly,try not collect to driver if not nessessary,instead (use foreachpartition)send data from ececutors;secondly,if not use some high performance  ser/deser like kryo, we could have a try.As a summary,I recommend the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack

签名由 网易邮箱大师 定制

20200422 23:24[hidden email] 写道:

Hi Andrew,

    Thank you for your reply, I am using the scala api of spark, and the tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid in this situation?

In addition, the current bottleneck of the application is that the amount of data transferred through the network(use collect()) is too large, and the deserialization seems to take some time.

 

Best wishes,

maqy

 

发件人: [hidden email]
发送时间: 2020422 21:02
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:

> 

> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

 

(I presume you're using the python tensorflow API, if you're not, please ignore)

 

There is a JIRA/PR ([1] [2]) which proposes to add native support for

Arrow serialization to python,

 

Under the hood, Spark is already serializing into Arrow format to

transmit to python, it's just additionally doing an unconditional

conversion to pandas once it reaches the python runner -- which is

good if you're using pandas, not so great if pandas isn't what you

operate on. The JIRA above would let you receive the arrow buffers

(that already exist) directly.

 

Cheers,

Andrew

[1] https://issues.apache.org/jira/browse/SPARK-30153

[2] https://github.com/apache/spark/pull/26783

 

> 

> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 发件人: Michael Artz

> 发送时间: 2020422 16:09

> 收件人: maqy

> 抄送: [hidden email]

> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

> 

> 

> 

> What would you do with it once you get it into driver in a Dataset[Row]?

> 

> Sent from my iPhone

> 

> 

> 

> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:

> 

> 

> 

> When the data is stored in the Dataset [Row] format, the memory usage is very small.

> 

> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

> 

> So, can I collect Dataset[Row] to driver and keep its data format?

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 

 

---------------------------------------------------------------------

To unsubscribe e-mail: [hidden email]

 

r

 

 

Reply | Threaded
Open this post in threaded view
|

回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

maqy1995@outlook.com
In reply to this post by Tang Jinxin

Hi Jinxin,

Thanks for your suggestions, I will try to use foreachpartition later.

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020423 7:31
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

Hi maqy,
Thanks for your question.Through consideration,I have some ideas as  follow:firstly,try not collect to driver if not nessessary,instead (use foreachpartition)send data from ececutors;secondly,if not use some high performance  ser/deser like kryo, we could have a try.As a summary,I recommend the first point(desigh a more efficient way when the data is too large).
Best wishes,
littlestar

xiaoxingstack

签名由 网易邮箱大师 定制

20200422 23:24[hidden email] 写道:

Hi Andrew,

    Thank you for your reply, I am using the scala api of spark, and the tensorflow machine is not in the spark cluster. Is this JIRA / PR still valid in this situation?

In addition, the current bottleneck of the application is that the amount of data transferred through the network(use collect()) is too large, and the deserialization seems to take some time.

 

Best wishes,

maqy

 

发件人: [hidden email]
发送时间: 2020422 21:02
收件人: [hidden email]
抄送: [hidden email]; [hidden email]
主题: Re: Can I collect Dataset[Row] to driver without converting it toArray [Row]?

 

On Wed, Apr 22, 2020 at 3:24 AM maqy <[hidden email]> wrote:

> 

> I will traverse this Dataset to convert it to Arrow and send it to Tensorflow through Socket.

 

(I presume you're using the python tensorflow API, if you're not, please ignore)

 

There is a JIRA/PR ([1] [2]) which proposes to add native support for

Arrow serialization to python,

 

Under the hood, Spark is already serializing into Arrow format to

transmit to python, it's just additionally doing an unconditional

conversion to pandas once it reaches the python runner -- which is

good if you're using pandas, not so great if pandas isn't what you

operate on. The JIRA above would let you receive the arrow buffers

(that already exist) directly.

 

Cheers,

Andrew

[1] https://issues.apache.org/jira/browse/SPARK-30153

[2] https://github.com/apache/spark/pull/26783

 

> 

> I tried to use toLocalIterator() to traverse the dataset instead of collect  to the driver, but toLocalIterator() will create a lot of jobs and will bring a lot of time consumption.

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 发件人: Michael Artz

> 发送时间: 2020422 16:09

> 收件人: maqy

> 抄送: [hidden email]

> 主题: Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

> 

> 

> 

> What would you do with it once you get it into driver in a Dataset[Row]?

> 

> Sent from my iPhone

> 

> 

> 

> On Apr 22, 2020, at 3:06 AM, maqy <[hidden email]> wrote:

> 

> 

> 

> When the data is stored in the Dataset [Row] format, the memory usage is very small.

> 

> When I use collect () to collect data to the driver, each line of the dataset will be converted to Row and stored in an array, which will bring great memory overhead.

> 

> So, can I collect Dataset[Row] to driver and keep its data format?

> 

> 

> 

> Best regards,

> 

> maqy

> 

> 

> 

> 

 

---------------------------------------------------------------------

To unsubscribe e-mail: [hidden email]

 

r