Getting PySpark Partitions Locations

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting PySpark Partitions Locations

Tzahi File
Hi,

I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".

Is there any way to get the partitions created?
e.g. 
day=2020-06-20/hour=1/country=US
day=2020-06-20/hour=2/country=US
......

--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Getting PySpark Partitions Locations

Jörn Franke
By doing a select on the df ?

Am 25.06.2020 um 14:52 schrieb Tzahi File <[hidden email]>:


Hi,

I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".

Is there any way to get the partitions created?
e.g. 
day=2020-06-20/hour=1/country=US
day=2020-06-20/hour=2/country=US
......

--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Getting PySpark Partitions Locations

Tzahi File
I don't want to query with a distinct on the partitioned columns, the df contains over 1 Billion of records. 
I just want to know the partitions that were created..

On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke <[hidden email]> wrote:
By doing a select on the df ?

Am 25.06.2020 um 14:52 schrieb Tzahi File <[hidden email]>:


Hi,

I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".

Is there any way to get the partitions created?
e.g. 
day=2020-06-20/hour=1/country=US
day=2020-06-20/hour=2/country=US
......

--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Getting PySpark Partitions Locations

Sanjeev Mishra

On Thu, Jun 25, 2020 at 6:19 AM Tzahi File <[hidden email]> wrote:
I don't want to query with a distinct on the partitioned columns, the df contains over 1 Billion of records. 
I just want to know the partitions that were created..

On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke <[hidden email]> wrote:
By doing a select on the df ?

Am 25.06.2020 um 14:52 schrieb Tzahi File <[hidden email]>:


Hi,

I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".

Is there any way to get the partitions created?
e.g. 
day=2020-06-20/hour=1/country=US
day=2020-06-20/hour=2/country=US
......

--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Getting PySpark Partitions Locations

srowen
In reply to this post by Tzahi File
You can always list the S3 output path, of course.

On Thu, Jun 25, 2020 at 7:52 AM Tzahi File <[hidden email]> wrote:
Hi,

I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".

Is there any way to get the partitions created?
e.g. 
day=2020-06-20/hour=1/country=US
day=2020-06-20/hour=2/country=US
......

--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.