Overwrite Mode not Working Correctly in spark 3.0.0

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Overwrite Mode not Working Correctly in spark 3.0.0

anbutech
Hi Team,

I'm facing weird behavior in the pyspark dataframe(databricks delta spark
3.0.0 supported)

I have tried the below two options to write the processed datafame data into
delta table with respect to the partition columns in the table.Actually
overwrite mode completely overwrite the whole table.i couldn't figure it out
why did the dataframe fully overwrite here.

Also i'm getting the following error while testing with below option 2


Predicate references non-partition column 'json_feeds_flatten_data'. Only
the partition columns may be referenced: [table_name, y, m, d, h];

could you please me why did the pyspark behavior like this?.It would be very
helpful to know the mistake here.

sample partition column values:
-------------------------------

table_name='json_feeds_flatten_data'
y=2020
m=7
d=19
h=0

Option 1:

partition_keys=['table_name','y','m','d','h']

         (final_df
          .withColumn('y', lit(y).cast('int'))
           .withColumn('m', lit(m).cast('int'))
           .withColumn('d', lit(d).cast('int'))
           .withColumn('h', lit(h).cast('int'))
           .write
           .partitionBy(partition_keys)
           .format("delta")
           .mode('overwrite')
           .saveAsTable(target_table)
         )

Option 2:

rep_wh = 'table_name={} AND y={} AND m={} AND d={} AND
h={}'.format(table_name,y, m, d, h)
        (final_df
          .withColumn('y', lit(y).cast('int'))
          .withColumn('m', lit(m).cast('int'))
          .withColumn('d', lit(d).cast('int'))
          .withColumn('h', lit(h).cast('int'))
          .write
          .format("delta")
          .mode('overwrite')
          .option('replaceWhere', rep_wh )
          .saveAsTable(target_table)
        )

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Overwrite Mode not Working Correctly in spark 3.0.0

Piyush Acharya
Can you please send the error message? it would ve very helpful to get to the root cause.

On Sun, Jul 19, 2020 at 10:57 PM anbutech <[hidden email]> wrote:
Hi Team,

I'm facing weird behavior in the pyspark dataframe(databricks delta spark
3.0.0 supported)

I have tried the below two options to write the processed datafame data into
delta table with respect to the partition columns in the table.Actually
overwrite mode completely overwrite the whole table.i couldn't figure it out
why did the dataframe fully overwrite here.

Also i'm getting the following error while testing with below option 2


Predicate references non-partition column 'json_feeds_flatten_data'. Only
the partition columns may be referenced: [table_name, y, m, d, h];

could you please me why did the pyspark behavior like this?.It would be very
helpful to know the mistake here.

sample partition column values:
-------------------------------

table_name='json_feeds_flatten_data'
y=2020
m=7
d=19
h=0

Option 1:

partition_keys=['table_name','y','m','d','h']

         (final_df
          .withColumn('y', lit(y).cast('int'))
           .withColumn('m', lit(m).cast('int'))
           .withColumn('d', lit(d).cast('int'))
           .withColumn('h', lit(h).cast('int'))
           .write
           .partitionBy(partition_keys)
           .format("delta")
           .mode('overwrite')
           .saveAsTable(target_table)
         )

Option 2:

rep_wh = 'table_name={} AND y={} AND m={} AND d={} AND
h={}'.format(table_name,y, m, d, h)
        (final_df
          .withColumn('y', lit(y).cast('int'))
          .withColumn('m', lit(m).cast('int'))
          .withColumn('d', lit(d).cast('int'))
          .withColumn('h', lit(h).cast('int'))
          .write
          .format("delta")
          .mode('overwrite')
          .option('replaceWhere', rep_wh )
          .saveAsTable(target_table)
        )

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Overwrite Mode not Working Correctly in spark 3.0.0

anbutech
Hi,

When im using option 1,it is completely overwrite the whole table.this is
not expected here.im running for multiple tables with different hours.

When im using option 2,im getting the following error

Predicate references non-partition column 'json_feeds_flatten_data'. Only
the partition columns may be referenced: [table_name, y, m, d, h];

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]