I've been trying to migrate some piece of code from Scala - Spark 2.X to PySpark 3.0.1. Part of the software includes a User-Defined Aggregate Function (UDAF), which represented a two-fold problem:
The UserDefinedAggregateFunction abstract class is deprecated in Spark >= 3.0 in favor of the Aggregator abstract class.
Pyspark doesn't implement UDAFs. It only has, as far as I can tell, a registerJavaUDAF function.
I thus reimplemented the UserDefinedAggregateFunction as an Aggregator, and attempted to use registerJavaUDAF() to register the Scala code in PySpark. However, I'm met with the following exception:
AnalysisException: class <class-name> doesn't implement interface UserDefinedAggregateFunction;
I'm able to use the old class implementing the UserDefinedAggregateFunction abstract class but keeping legacy code isn't desirable. My understanding of the documentation led me to believe pyspark would be able to register Scala UDAF implementing the Aggregator abstract class. I feel this is a bug, or at least means the documentation is outdated. It also means there's, as far as I can tell, no way to use UDAF natively with PySpark >= 3.0.
Am I missing a solution?
DISCLAIMER : The content of this e-mail
message does not constitute a commitment of S.A. ALOALTO N.V. or its
subsidiaries/affiliates. This e-mail and any attachments thereto may contain
information which is confidential and/or protected by intellectual property
rights and are intended for the intended recipient only. Any use of the
information contained herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other than
the designated recipient(s) is prohibited. If an addressing or transmission
error has misdirected this e-mail, please notify the author, either by
telephone or by e-mail and delete the material from any computer.