Repartition or Coalesce not working

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Repartition or Coalesce not working

khajaasmath786
Hi,

I have a use case where there are large files in hdfs. 

Size of the file is 3 GB.

It is an existing code in production and I am trying to improve the performance of the job.

Sample Code: 
textDF=dataframe ( This is dataframe that got created from hdfs path)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))  --> Prints 1
textDF.repartition(100)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))  --> Prints 1
 
Any suggestions  on why this is happening?

Next Block of the code which takes time:
rdd.filter(lambda line: len(line)!=collistlenth)

any way to parallelize and speed up my process on this? 

Thanks,
Asmath
Reply | Threaded
Open this post in threaded view
|

Re: Repartition or Coalesce not working

srowen
You need to do something with the result of repartition. You haven't changed textDF

On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Hi,

I have a use case where there are large files in hdfs. 

Size of the file is 3 GB.

It is an existing code in production and I am trying to improve the performance of the job.

Sample Code: 
textDF=dataframe ( This is dataframe that got created from hdfs path)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))  --> Prints 1
textDF.repartition(100)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))  --> Prints 1
 
Any suggestions  on why this is happening?

Next Block of the code which takes time:
rdd.filter(lambda line: len(line)!=collistlenth)

any way to parallelize and speed up my process on this? 

Thanks,
Asmath
Reply | Threaded
Open this post in threaded view
|

Re: Repartition or Coalesce not working

khajaasmath786
Thanks Sean.I just realized it. Let me try that.

On Mon, Mar 22, 2021 at 12:31 PM Sean Owen <[hidden email]> wrote:
You need to do something with the result of repartition. You haven't changed textDF

On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Hi,

I have a use case where there are large files in hdfs. 

Size of the file is 3 GB.

It is an existing code in production and I am trying to improve the performance of the job.

Sample Code: 
textDF=dataframe ( This is dataframe that got created from hdfs path)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))  --> Prints 1
textDF.repartition(100)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))  --> Prints 1
 
Any suggestions  on why this is happening?

Next Block of the code which takes time:
rdd.filter(lambda line: len(line)!=collistlenth)

any way to parallelize and speed up my process on this? 

Thanks,
Asmath