Collecting large dataset

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Collecting large dataset

Rishikesh Gawade
Hi. 
I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data.
While collecting the dataframe, the spark job throws an error, TaskResultLost( resultset lost from blockmanager).
I searched for solutions around this and set the following properties:
spark.blockManager.port, maxResultSize to 0(unlimited), spark.driver.blockManager.port and the application within which spark driver is running has 28 gb of max heap size.
And yet the error arises again.
There are 22 executors running in my cluster.
Is there any config/necessary step that i am missing before collecting such large data?
Or is there any other effective approach that would guarantee collecting such large data without failure?

Thanks,
Rishikesh
Reply | Threaded
Open this post in threaded view
|

Re: Collecting large dataset

Marcin Tustin
Stop using collect for this purpose. Either continue your further processing in spark (maybe you need to use streaming), or sink the data to something that can accept the data (gcs/s3/azure storage/redshift/elasticsearch/whatever), and have further processing read from that sink.

On Thu, Sep 5, 2019 at 2:23 PM Rishikesh Gawade <[hidden email]> wrote:
This Message originated outside your organization.

Hi. 
I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data.
While collecting the dataframe, the spark job throws an error, TaskResultLost( resultset lost from blockmanager).
I searched for solutions around this and set the following properties:
spark.blockManager.port, maxResultSize to 0(unlimited), spark.driver.blockManager.port and the application within which spark driver is running has 28 gb of max heap size.
And yet the error arises again.
There are 22 executors running in my cluster.
Is there any config/necessary step that i am missing before collecting such large data?
Or is there any other effective approach that would guarantee collecting such large data without failure?

Thanks,
Rishikesh