Spark DataFrame Creation

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark DataFrame Creation

Mark Bidewell
Sorry if this is the wrong place for this.  I am trying to debug an issue with this library:

When I attempt to create a dataframe:

spark.read.
            format("com.springml.spark.sftp").
            option("host", "...").
            option("username", "...").
            option("password", "...").
            option("fileType", "csv").
            option("inferSchema", "true").
            option("tempLocation","/srv/spark/tmp").
            option("hdfsTempLocation","/srv/spark/tmp");
     .load("...")

What I am seeing is that the download is occurring on the spark driver not the spark worker,  This leads to a failure when spark tries to create the DataFrame on the worker.

I'm confused by the behavior.  my understanding was that load() was lazily executed on the Spark worker.  Why would some elements be executing on the driver?

Thanks for your help
--
Reply | Threaded
Open this post in threaded view
|

Re: Spark DataFrame Creation

srowen
You'd probably do best to ask that project, but scanning the source
code, that looks like it's how it's meant to work. It downloads to a
temp file on the driver then copies to distributed storage then
returns a DataFrame for that. I can't see how it would be implemented
directly over sftp as there would be so many pieces missing -
locality, blocking, etc.

On Wed, Jul 22, 2020 at 4:48 PM Mark Bidewell <[hidden email]> wrote:

>
> Sorry if this is the wrong place for this.  I am trying to debug an issue with this library:
> https://github.com/springml/spark-sftp
>
> When I attempt to create a dataframe:
>
> spark.read.
>             format("com.springml.spark.sftp").
>             option("host", "...").
>             option("username", "...").
>             option("password", "...").
>             option("fileType", "csv").
>             option("inferSchema", "true").
>             option("tempLocation","/srv/spark/tmp").
>             option("hdfsTempLocation","/srv/spark/tmp");
>      .load("...")
>
> What I am seeing is that the download is occurring on the spark driver not the spark worker,  This leads to a failure when spark tries to create the DataFrame on the worker.
>
> I'm confused by the behavior.  my understanding was that load() was lazily executed on the Spark worker.  Why would some elements be executing on the driver?
>
> Thanks for your help
> --
> Mark Bidewell
> http://www.linkedin.com/in/markbidewell

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark DataFrame Creation

Andrew Melo
In reply to this post by Mark Bidewell
Hi Mark,

On Wed, Jul 22, 2020 at 4:49 PM Mark Bidewell <[hidden email]> wrote:

>
> Sorry if this is the wrong place for this.  I am trying to debug an issue with this library:
> https://github.com/springml/spark-sftp
>
> When I attempt to create a dataframe:
>
> spark.read.
>             format("com.springml.spark.sftp").
>             option("host", "...").
>             option("username", "...").
>             option("password", "...").
>             option("fileType", "csv").
>             option("inferSchema", "true").
>             option("tempLocation","/srv/spark/tmp").
>             option("hdfsTempLocation","/srv/spark/tmp");
>      .load("...")
>
> What I am seeing is that the download is occurring on the spark driver not the spark worker,  This leads to a failure when spark tries to create the DataFrame on the worker.
>
> I'm confused by the behavior.  my understanding was that load() was lazily executed on the Spark worker.  Why would some elements be executing on the driver?

Looking at the code, it appears that your sftp plugin downloads the
file to a local location and opens from there.

https://github.com/springml/spark-sftp/blob/090917547001574afa93cddaf2a022151a3f4260/src/main/scala/com/springml/spark/sftp/DefaultSource.scala#L38

You may have more luck with an sftp hadoop filesystem plugin that can
read sftp:// URLs directly.

Cheers
Andrew
>
> Thanks for your help
> --
> Mark Bidewell
> http://www.linkedin.com/in/markbidewell

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]