How to parallelize zip file processing?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to parallelize zip file processing?

mytramesh
I know, spark doesn’t support zip file directly since it not distributable.
Any techniques to process this file quickly?

I am trying to process around 4GB zip file. All data is moving one executor,
and only one task is getting assigned to process all the data.

Even when I run repartition method, data is getting portioned but on same
executor.


How to distribute data to other executors?
How to get assigned more tasks/threads when It got portioned on same
executor?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to parallelize zip file processing?

Jörn Franke
Does the zip file contain only one file? I fear in this case you can only have one core.

Do you mean by the way gzip? In this case you cannot decompress it in parallel...

How is the zip file created ? Can’t you create several ones?

> On 10. Aug 2018, at 22:54, mytramesh <[hidden email]> wrote:
>
> I know, spark doesn’t support zip file directly since it not distributable.
> Any techniques to process this file quickly?
>
> I am trying to process around 4GB zip file. All data is moving one executor,
> and only one task is getting assigned to process all the data.
>
> Even when I run repartition method, data is getting portioned but on same
> executor.
>
>
> How to distribute data to other executors?
> How to get assigned more tasks/threads when It got portioned on same
> executor?
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to parallelize zip file processing?

mytramesh

Thanks for your reply. DataSet I am receiving from MainFrames system which I
don't have control .

    Tried below things to move data to other executors but not succeeded

      1. Called repartition method, data got re-partitioned but on same
executor. Only one core is processing all these partitions.

      2.  Once I read zip files into RDD , saved to S3 file system and
re-reading as distributable file. In this scenario also data is getting
loaded to one executor and one core is processing this data.

      any suggestion to move this data to other executors ?



 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]