Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

Aviad Klein
Hi, I've referenced the same problem on stack overflow and can't seem to find answers.

I have custom spark pipelinestages written in scala that are specific to my organization. They work well on scala-spark.

However, when I try to wrap them as shown here, so I can use them in pyspark, I get weird stuff that's happening. mostly around constructors of the java objects

please refer to the stack overflow question, it's the most documented.

Thanks, any help is appreciated 

--
Aviad Klein
Director of Data Science


Reply | Threaded
Open this post in threaded view
|

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

srowen
Looks like you are building vs Spark 3 and running on Spark 2, or something along those lines.

On Mon, Aug 17, 2020 at 4:02 AM Aviad Klein <[hidden email]> wrote:
Hi, I've referenced the same problem on stack overflow and can't seem to find answers.

I have custom spark pipelinestages written in scala that are specific to my organization. They work well on scala-spark.

However, when I try to wrap them as shown here, so I can use them in pyspark, I get weird stuff that's happening. mostly around constructors of the java objects

please refer to the stack overflow question, it's the most documented.

Thanks, any help is appreciated 

--
Aviad Klein
Director of Data Science


Reply | Threaded
Open this post in threaded view
|

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

Aviad Klein
Hi Owen, it's omitted from what I pasted but I'm using spark 2.4.4 on both.

On Mon, Aug 17, 2020 at 4:37 PM Sean Owen <[hidden email]> wrote:
Looks like you are building vs Spark 3 and running on Spark 2, or something along those lines.

On Mon, Aug 17, 2020 at 4:02 AM Aviad Klein <[hidden email]> wrote:
Hi, I've referenced the same problem on stack overflow and can't seem to find answers.

I have custom spark pipelinestages written in scala that are specific to my organization. They work well on scala-spark.

However, when I try to wrap them as shown here, so I can use them in pyspark, I get weird stuff that's happening. mostly around constructors of the java objects

please refer to the stack overflow question, it's the most documented.

Thanks, any help is appreciated 

--
Aviad Klein
Director of Data Science




--
Aviad Klein
Director of Data Science


Reply | Threaded
Open this post in threaded view
|

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

srowen
Hm, next guess: you need a no-arg constructor this() on FooTransformer? also consider extending UnaryTransformer.

On Mon, Aug 17, 2020 at 9:08 AM Aviad Klein <[hidden email]> wrote:
Hi Owen, it's omitted from what I pasted but I'm using spark 2.4.4 on both.

On Mon, Aug 17, 2020 at 4:37 PM Sean Owen <[hidden email]> wrote:
Looks like you are building vs Spark 3 and running on Spark 2, or something along those lines.

On Mon, Aug 17, 2020 at 4:02 AM Aviad Klein <[hidden email]> wrote:
Hi, I've referenced the same problem on stack overflow and can't seem to find answers.

I have custom spark pipelinestages written in scala that are specific to my organization. They work well on scala-spark.

However, when I try to wrap them as shown here, so I can use them in pyspark, I get weird stuff that's happening. mostly around constructors of the java objects

please refer to the stack overflow question, it's the most documented.

Thanks, any help is appreciated 

--
Aviad Klein
Director of Data Science




--
Aviad Klein
Director of Data Science


Reply | Threaded
Open this post in threaded view
|

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

chris-2
Hi,

I took your code and ran it on spark 2.4.5 and it works ok for me. My first though, like Sean, is that you have some Spark ML version mismatch somewhere.

Chris 

On 17 Aug 2020, at 16:18, Sean Owen <[hidden email]> wrote:


Hm, next guess: you need a no-arg constructor this() on FooTransformer? also consider extending UnaryTransformer.

On Mon, Aug 17, 2020 at 9:08 AM Aviad Klein <[hidden email]> wrote:
Hi Owen, it's omitted from what I pasted but I'm using spark 2.4.4 on both.

On Mon, Aug 17, 2020 at 4:37 PM Sean Owen <[hidden email]> wrote:
Looks like you are building vs Spark 3 and running on Spark 2, or something along those lines.

On Mon, Aug 17, 2020 at 4:02 AM Aviad Klein <[hidden email]> wrote:
Hi, I've referenced the same problem on stack overflow and can't seem to find answers.

I have custom spark pipelinestages written in scala that are specific to my organization. They work well on scala-spark.

However, when I try to wrap them as shown here, so I can use them in pyspark, I get weird stuff that's happening. mostly around constructors of the java objects

please refer to the stack overflow question, it's the most documented.

Thanks, any help is appreciated 

--
Aviad Klein
Director of Data Science




--
Aviad Klein
Director of Data Science


Reply | Threaded
Open this post in threaded view
|

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

Aviad Klein
Hey Chris and Sean, thanks for taking the time to answer.

Perhaps my installation of pyspark is off, although I did use version 2.4.4
When developing in scala and pyspark how do you setup your environment?

I used sbt for scala spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.4",
"org.apache.spark" %% "spark-sql" % "2.4.4",
"org.scalactic" %% "scalactic" % "3.1.2",
"org.scalatest" %% "scalatest" % "3.1.2" % "test",
"org.apache.spark" %% "spark-mllib" % "2.4.4",
"org.plotly-scala" %% "plotly-render" % "0.7.2",
"com.github.fommil.netlib" % "all" % "1.1.2" pomOnly()
)

and pip for pyspark (python 3.6.5)
pip3 install pyspark==2.4.4


Reply | Threaded
Open this post in threaded view
|

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

srowen
That looks roughly right, though you will want to mark Spark
dependencies as provided. Do you need netlib directly?
Pyspark won't matter here if you're in Scala; what's installed with
pip would not matter in any event.

On Tue, Aug 25, 2020 at 3:30 AM Aviad Klein <[hidden email]> wrote:

>
> Hey Chris and Sean, thanks for taking the time to answer.
>
> Perhaps my installation of pyspark is off, although I did use version 2.4.4
> When developing in scala and pyspark how do you setup your environment?
>
> I used sbt for scala spark
>
> libraryDependencies ++= Seq(
>   "org.apache.spark" %% "spark-core" % "2.4.4",
>   "org.apache.spark" %% "spark-sql" % "2.4.4",
>   "org.scalactic" %% "scalactic" % "3.1.2",
>   "org.scalatest" %% "scalatest" % "3.1.2" % "test",
>   "org.apache.spark" %% "spark-mllib" % "2.4.4",
>   "org.plotly-scala" %% "plotly-render" % "0.7.2",
>   "com.github.fommil.netlib" % "all" % "1.1.2" pomOnly()
> )
>
>
> and pip for pyspark (python 3.6.5)
>
> pip3 install pyspark==2.4.4
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]