Questions on Python support with Spark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions on Python support with Spark

Arijit Tarafdar

Hello All,

 

We have a requirement to run PySpark in standalone cluster mode and also reference python libraries (egg/wheel) which are not local but placed in a distributed storage like HDFS. From the code it looks like none of cases are supported.

 

Questions are:

 

  1. Why is PySpark supported only in standalone client mode?
  2. Why –py-files only support local files and not files stored in remote stores?

 

We will like to update the Spark code to support these scenarios but just want to be aware of any technical difficulties that the community has faced while trying to support those.

 

Thanks, Arijit

Reply | Threaded
Open this post in threaded view
|

Re: Questions on Python support with Spark

Patrick McCarthy-2
I've never tried to run a stand-alone cluster alongside hadoop, but why not run Spark as a yarn application? That way it can absolutely (in fact preferably) use the distributed file system.

On Fri, Nov 9, 2018 at 5:04 PM, Arijit Tarafdar <[hidden email]> wrote:

Hello All,

 

We have a requirement to run PySpark in standalone cluster mode and also reference python libraries (egg/wheel) which are not local but placed in a distributed storage like HDFS. From the code it looks like none of cases are supported.

 

Questions are:

 

  1. Why is PySpark supported only in standalone client mode?
  2. Why –py-files only support local files and not files stored in remote stores?

 

We will like to update the Spark code to support these scenarios but just want to be aware of any technical difficulties that the community has faced while trying to support those.

 

Thanks, Arijit