Spark - Partitions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark - Partitions

khajaasmath786
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

Chetan Khatri

Use repartition

On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

Tushar Adeshara
In reply to this post by khajaasmath786

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: +91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

khajaasmath786
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

MidwestMike
Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

khajaasmath786
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

On Oct 17, 2017, at 2:32 PM, Michael Artz <[hidden email]> wrote:

Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

sebastian.piu
You have to repartition/coalesce after the action that is causing the shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

On Oct 17, 2017, at 2:32 PM, Michael Artz <[hidden email]> wrote:

Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

khajaasmath786
In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <[hidden email]> wrote:
You have to repartition/coalesce after the action that is causing the shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

On Oct 17, 2017, at 2:32 PM, Michael Artz <[hidden email]> wrote:

Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.


Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

sebastian.piu

Can you share some code?


On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <[hidden email]> wrote:
In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <[hidden email]> wrote:
You have to repartition/coalesce after the action that is causing the shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

On Oct 17, 2017, at 2:32 PM, Michael Artz <[hidden email]> wrote:

Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.


Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

khajaasmath786
    val unionDS = rawDS.union(processedDS)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      val unionedDS = unionDS.dropDuplicates()
      //val unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      unionDS.repartition(numPartitions);
      unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
      sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
      val deltaDSQry = "insert overwrite table  datapoint PARTITION(year,month,day) select VIN, utctime, description, descriptionuom, providerdesc, dt_map, islocation, latitude, longitude, speed, value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
      println(deltaDSQry)
      sparkSession.sql(deltaDSQry)


Here is the code and also properties used in my project.


On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <[hidden email]> wrote:

Can you share some code?


On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <[hidden email]> wrote:
In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <[hidden email]> wrote:
You have to repartition/coalesce after the action that is causing the shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

On Oct 17, 2017, at 2:32 PM, Michael Artz <[hidden email]> wrote:

Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.





---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

application-datapoint-hdfs-dyn.properties (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Spark - Partitions

sebastian.piu

Change this
unionDS.repartition(numPartitions);
unionDS.createOrReplaceTempView(...

To

unionDS.repartition(numPartitions).createOrReplaceTempView(...


On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, <[hidden email]> wrote:
    val unionDS = rawDS.union(processedDS)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      val unionedDS = unionDS.dropDuplicates()
      //val unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      unionDS.repartition(numPartitions);
      unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
      sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
      val deltaDSQry = "insert overwrite table  datapoint PARTITION(year,month,day) select VIN, utctime, description, descriptionuom, providerdesc, dt_map, islocation, latitude, longitude, speed, value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
      println(deltaDSQry)
      sparkSession.sql(deltaDSQry)


Here is the code and also properties used in my project.


On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <[hidden email]> wrote:

Can you share some code?


On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <[hidden email]> wrote:
In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <[hidden email]> wrote:
You have to repartition/coalesce after the action that is causing the shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <[hidden email]> wrote:
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

On Oct 17, 2017, at 2:32 PM, Michael Artz <[hidden email]> wrote:

Have you tried caching it and using a coalesce? 



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <[hidden email]> wrote:
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <[hidden email]> wrote:

You can also try coalesce as it will avoid full shuffle.


Regards,

Tushar Adeshara

Technical Specialist – Analytics Practice

Cell: <a href="tel:+91%2081490%2004192" value="+918149004192" target="_blank">+91-81490 04192

Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com




From: KhajaAsmath Mohammed <[hidden email]>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
 
Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.