Write Partitioned Parquet Using UDF On Partition Column

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Write Partitioned Parquet Using UDF On Partition Column

Richard Primera
This post was updated on .
Greetings,


In version 1.6.0, is it possible to write a partitioned dataframe into
parquet format using a UDF function on the partition column? I'm using
pyspark.

Let's say I have a dataframe with coumn `date`, of type string or int, which
contains values such as `20170825`. Is it possible to define a UDF called
`by_month` or `by_year`, which could then be used to write the table as
parquet, ideally in this way:

*dataframe.write.format("parquet").partitionBy(by_month(dataframe["date"])).save("/some/parquet")*

I haven't even tried this so I don't know if it's possible. If so, what are
the ways by which this can be done? Ideally, without having to resort to add
an additional column like `part_id` to the dataframe with the result of
`by_month(date)` and partitioning by that column instead.


Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org