Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Jianshi Huang
Hi,

I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.

I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.

Anyone knows how to override the PYTHONPATH on worker nodes?

Here's the error message,

Py4JJavaError: An error occurred while calling o75.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
Error from python worker:
Traceback (most recent call last):
File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
ModuleNotFoundError: No module named 'py4j'
PYTHONPATH was:
/usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar

And here's how I started Pyspark session in Jupyter.

%env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
%env PYSPARK_PYTHON=/usr/bin/python3
import findspark
findspark.init()
import pyspark
sparkConf = pyspark.SparkConf()
sparkConf.setAll([
    ('spark.cores.max', '96')
    ,('spark.driver.memory', '2g')
    ,('spark.executor.cores', '4')
    ,('spark.executor.instances', '2')
    ,('spark.executor.memory', '4g')
    ,('spark.network.timeout', '800')
    ,('spark.scheduler.mode', 'FAIR')
    ,('spark.shuffle.service.enabled', 'true')
    ,('spark.dynamicAllocation.enabled', 'true')
])
py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)



Thanks,
--
Jianshi Huang

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Apostolos N. Papadopoulos

Maybe this can help.

https://stackoverflow.com/questions/32959723/set-python-path-for-spark-worker



On 04/10/2018 12:19 μμ, Jianshi Huang wrote:
Hi,

I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.

I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.

Anyone knows how to override the PYTHONPATH on worker nodes?

Here's the error message,

Py4JJavaError: An error occurred while calling o75.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
Error from python worker:
Traceback (most recent call last):
File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
ModuleNotFoundError: No module named 'py4j'
PYTHONPATH was:
/usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar

And here's how I started Pyspark session in Jupyter.

%env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
%env PYSPARK_PYTHON=/usr/bin/python3
import findspark
findspark.init()
import pyspark
sparkConf = pyspark.SparkConf()
sparkConf.setAll([
    ('spark.cores.max', '96')
    ,('spark.driver.memory', '2g')
    ,('spark.executor.cores', '4')
    ,('spark.executor.instances', '2')
    ,('spark.executor.memory', '4g')
    ,('spark.network.timeout', '800')
    ,('spark.scheduler.mode', 'FAIR')
    ,('spark.shuffle.service.enabled', 'true')
    ,('spark.dynamicAllocation.enabled', 'true')
])
py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)



Thanks,
--
Jianshi Huang


-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: [hidden email]
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol
Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Marcelo Vanzin-2
In reply to this post by Jianshi Huang
Normally the version of Spark installed on the cluster does not
matter, since Spark is uploaded from your gateway machine to YARN by
default.

You probably have some configuration (in spark-defaults.conf) that
tells YARN to use a cached copy. Get rid of that configuration, and
you can use whatever version you like.
On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:

>
> Hi,
>
> I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>
> I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>
> Anyone knows how to override the PYTHONPATH on worker nodes?
>
> Here's the error message,
>>
>>
>> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> Error from python worker:
>> Traceback (most recent call last):
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> __import__(pkg_name)
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>
>
> And here's how I started Pyspark session in Jupyter.
>>
>>
>> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> %env PYSPARK_PYTHON=/usr/bin/python3
>> import findspark
>> findspark.init()
>> import pyspark
>> sparkConf = pyspark.SparkConf()
>> sparkConf.setAll([
>>     ('spark.cores.max', '96')
>>     ,('spark.driver.memory', '2g')
>>     ,('spark.executor.cores', '4')
>>     ,('spark.executor.instances', '2')
>>     ,('spark.executor.memory', '4g')
>>     ,('spark.network.timeout', '800')
>>     ,('spark.scheduler.mode', 'FAIR')
>>     ,('spark.shuffle.service.enabled', 'true')
>>     ,('spark.dynamicAllocation.enabled', 'true')
>> ])
>> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>>
>
>
> Thanks,
> --
> Jianshi Huang
>


--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Jianshi Huang

The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path. 

I don't want to make any changes to worker node configuration, so any way to override the order?

Jianshi

On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
Normally the version of Spark installed on the cluster does not
matter, since Spark is uploaded from your gateway machine to YARN by
default.

You probably have some configuration (in spark-defaults.conf) that
tells YARN to use a cached copy. Get rid of that configuration, and
you can use whatever version you like.
On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>
> Hi,
>
> I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>
> I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>
> Anyone knows how to override the PYTHONPATH on worker nodes?
>
> Here's the error message,
>>
>>
>> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> Error from python worker:
>> Traceback (most recent call last):
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> __import__(pkg_name)
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>
>
> And here's how I started Pyspark session in Jupyter.
>>
>>
>> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> %env PYSPARK_PYTHON=/usr/bin/python3
>> import findspark
>> findspark.init()
>> import pyspark
>> sparkConf = pyspark.SparkConf()
>> sparkConf.setAll([
>>     ('spark.cores.max', '96')
>>     ,('spark.driver.memory', '2g')
>>     ,('spark.executor.cores', '4')
>>     ,('spark.executor.instances', '2')
>>     ,('spark.executor.memory', '4g')
>>     ,('spark.network.timeout', '800')
>>     ,('spark.scheduler.mode', 'FAIR')
>>     ,('spark.shuffle.service.enabled', 'true')
>>     ,('spark.dynamicAllocation.enabled', 'true')
>> ])
>> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>>
>
>
> Thanks,
> --
> Jianshi Huang
>


--
Marcelo


--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/
Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Marcelo Vanzin-2
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
expanded by the shell).

But it's really weird to be setting SPARK_HOME in the environment of
your node managers. YARN shouldn't need to know about that.
On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <[hidden email]> wrote:

>
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>
> The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path.
>
> I don't want to make any changes to worker node configuration, so any way to override the order?
>
> Jianshi
>
> On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
>>
>> Normally the version of Spark installed on the cluster does not
>> matter, since Spark is uploaded from your gateway machine to YARN by
>> default.
>>
>> You probably have some configuration (in spark-defaults.conf) that
>> tells YARN to use a cached copy. Get rid of that configuration, and
>> you can use whatever version you like.
>> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>> >
>> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >
>> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >
>> > Here's the error message,
>> >>
>> >>
>> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> >> Error from python worker:
>> >> Traceback (most recent call last):
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> >> __import__(pkg_name)
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> >> ModuleNotFoundError: No module named 'py4j'
>> >> PYTHONPATH was:
>> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >
>> >
>> > And here's how I started Pyspark session in Jupyter.
>> >>
>> >>
>> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> import findspark
>> >> findspark.init()
>> >> import pyspark
>> >> sparkConf = pyspark.SparkConf()
>> >> sparkConf.setAll([
>> >>     ('spark.cores.max', '96')
>> >>     ,('spark.driver.memory', '2g')
>> >>     ,('spark.executor.cores', '4')
>> >>     ,('spark.executor.instances', '2')
>> >>     ,('spark.executor.memory', '4g')
>> >>     ,('spark.network.timeout', '800')
>> >>     ,('spark.scheduler.mode', 'FAIR')
>> >>     ,('spark.shuffle.service.enabled', 'true')
>> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> ])
>> >> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>> >>
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Gourav Sengupta
Hi Marcelo,
it will be great if you illustrate what you mean, I will be interested to know.

Hi Jianshi,
so just to be sure you want to work on SPARK 2.3 while having SPARK 2.1 installed in your cluster? 

Regards,
Gourav Sengupta

On Thu, Oct 4, 2018 at 6:26 PM Marcelo Vanzin <[hidden email]> wrote:
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
expanded by the shell).

But it's really weird to be setting SPARK_HOME in the environment of
your node managers. YARN shouldn't need to know about that.
On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <[hidden email]> wrote:
>
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>
> The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path.
>
> I don't want to make any changes to worker node configuration, so any way to override the order?
>
> Jianshi
>
> On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
>>
>> Normally the version of Spark installed on the cluster does not
>> matter, since Spark is uploaded from your gateway machine to YARN by
>> default.
>>
>> You probably have some configuration (in spark-defaults.conf) that
>> tells YARN to use a cached copy. Get rid of that configuration, and
>> you can use whatever version you like.
>> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>> >
>> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >
>> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >
>> > Here's the error message,
>> >>
>> >>
>> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> >> Error from python worker:
>> >> Traceback (most recent call last):
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> >> __import__(pkg_name)
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> >> ModuleNotFoundError: No module named 'py4j'
>> >> PYTHONPATH was:
>> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >
>> >
>> > And here's how I started Pyspark session in Jupyter.
>> >>
>> >>
>> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> import findspark
>> >> findspark.init()
>> >> import pyspark
>> >> sparkConf = pyspark.SparkConf()
>> >> sparkConf.setAll([
>> >>     ('spark.cores.max', '96')
>> >>     ,('spark.driver.memory', '2g')
>> >>     ,('spark.executor.cores', '4')
>> >>     ,('spark.executor.instances', '2')
>> >>     ,('spark.executor.memory', '4g')
>> >>     ,('spark.network.timeout', '800')
>> >>     ,('spark.scheduler.mode', 'FAIR')
>> >>     ,('spark.shuffle.service.enabled', 'true')
>> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> ])
>> >> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>> >>
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Jianshi Huang
Yes, that's right. 

On Fri, Oct 5, 2018 at 3:35 AM Gourav Sengupta <[hidden email]> wrote:
Hi Marcelo,
it will be great if you illustrate what you mean, I will be interested to know.

Hi Jianshi,
so just to be sure you want to work on SPARK 2.3 while having SPARK 2.1 installed in your cluster? 

Regards,
Gourav Sengupta

On Thu, Oct 4, 2018 at 6:26 PM Marcelo Vanzin <[hidden email]> wrote:
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
expanded by the shell).

But it's really weird to be setting SPARK_HOME in the environment of
your node managers. YARN shouldn't need to know about that.
On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <[hidden email]> wrote:
>
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>
> The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path.
>
> I don't want to make any changes to worker node configuration, so any way to override the order?
>
> Jianshi
>
> On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
>>
>> Normally the version of Spark installed on the cluster does not
>> matter, since Spark is uploaded from your gateway machine to YARN by
>> default.
>>
>> You probably have some configuration (in spark-defaults.conf) that
>> tells YARN to use a cached copy. Get rid of that configuration, and
>> you can use whatever version you like.
>> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>> >
>> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >
>> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >
>> > Here's the error message,
>> >>
>> >>
>> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> >> Error from python worker:
>> >> Traceback (most recent call last):
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> >> __import__(pkg_name)
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> >> ModuleNotFoundError: No module named 'py4j'
>> >> PYTHONPATH was:
>> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >
>> >
>> > And here's how I started Pyspark session in Jupyter.
>> >>
>> >>
>> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> import findspark
>> >> findspark.init()
>> >> import pyspark
>> >> sparkConf = pyspark.SparkConf()
>> >> sparkConf.setAll([
>> >>     ('spark.cores.max', '96')
>> >>     ,('spark.driver.memory', '2g')
>> >>     ,('spark.executor.cores', '4')
>> >>     ,('spark.executor.instances', '2')
>> >>     ,('spark.executor.memory', '4g')
>> >>     ,('spark.network.timeout', '800')
>> >>     ,('spark.scheduler.mode', 'FAIR')
>> >>     ,('spark.shuffle.service.enabled', 'true')
>> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> ])
>> >> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>> >>
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Jianshi Huang

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Jianshi Huang
In reply to this post by Marcelo Vanzin-2
Thanks Marcelo,

But I don't want to install 2.3.2 on the worker nodes. I just want Spark to use the path of the files uploaded to YARN instead of the SPARK_HOME.

On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin <[hidden email]> wrote:
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
expanded by the shell).

But it's really weird to be setting SPARK_HOME in the environment of
your node managers. YARN shouldn't need to know about that.
On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <[hidden email]> wrote:
>
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>
> The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path.
>
> I don't want to make any changes to worker node configuration, so any way to override the order?
>
> Jianshi
>
> On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
>>
>> Normally the version of Spark installed on the cluster does not
>> matter, since Spark is uploaded from your gateway machine to YARN by
>> default.
>>
>> You probably have some configuration (in spark-defaults.conf) that
>> tells YARN to use a cached copy. Get rid of that configuration, and
>> you can use whatever version you like.
>> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>> >
>> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >
>> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >
>> > Here's the error message,
>> >>
>> >>
>> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> >> Error from python worker:
>> >> Traceback (most recent call last):
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> >> __import__(pkg_name)
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> >> ModuleNotFoundError: No module named 'py4j'
>> >> PYTHONPATH was:
>> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >
>> >
>> > And here's how I started Pyspark session in Jupyter.
>> >>
>> >>
>> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> import findspark
>> >> findspark.init()
>> >> import pyspark
>> >> sparkConf = pyspark.SparkConf()
>> >> sparkConf.setAll([
>> >>     ('spark.cores.max', '96')
>> >>     ,('spark.driver.memory', '2g')
>> >>     ,('spark.executor.cores', '4')
>> >>     ,('spark.executor.instances', '2')
>> >>     ,('spark.executor.memory', '4g')
>> >>     ,('spark.network.timeout', '800')
>> >>     ,('spark.scheduler.mode', 'FAIR')
>> >>     ,('spark.shuffle.service.enabled', 'true')
>> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> ])
>> >> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>> >>
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



--
Marcelo


--
Jianshi Huang

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Jianshi Huang
In reply to this post by Marcelo Vanzin-2
Hi Marcelo,

I see what you mean. Tried it but still got same error message.

Error from python worker:
  Traceback (most recent call last):
    File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
      mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
    File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
      __import__(pkg_name)
    File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
    File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
  ModuleNotFoundError: No module named 'py4j'
PYTHONPATH was:
  /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk3/yarn/usercache/jianshi.huang/filecache/134/__spark_libs__8468485589501316413.zip/spark-core_2.11-2.3.2.jar

On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin <[hidden email]> wrote:
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
expanded by the shell).

But it's really weird to be setting SPARK_HOME in the environment of
your node managers. YARN shouldn't need to know about that.
On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <[hidden email]> wrote:
>
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>
> The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path.
>
> I don't want to make any changes to worker node configuration, so any way to override the order?
>
> Jianshi
>
> On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
>>
>> Normally the version of Spark installed on the cluster does not
>> matter, since Spark is uploaded from your gateway machine to YARN by
>> default.
>>
>> You probably have some configuration (in spark-defaults.conf) that
>> tells YARN to use a cached copy. Get rid of that configuration, and
>> you can use whatever version you like.
>> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>> >
>> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >
>> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >
>> > Here's the error message,
>> >>
>> >>
>> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> >> Error from python worker:
>> >> Traceback (most recent call last):
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> >> __import__(pkg_name)
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> >> ModuleNotFoundError: No module named 'py4j'
>> >> PYTHONPATH was:
>> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >
>> >
>> > And here's how I started Pyspark session in Jupyter.
>> >>
>> >>
>> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> import findspark
>> >> findspark.init()
>> >> import pyspark
>> >> sparkConf = pyspark.SparkConf()
>> >> sparkConf.setAll([
>> >>     ('spark.cores.max', '96')
>> >>     ,('spark.driver.memory', '2g')
>> >>     ,('spark.executor.cores', '4')
>> >>     ,('spark.executor.instances', '2')
>> >>     ,('spark.executor.memory', '4g')
>> >>     ,('spark.network.timeout', '800')
>> >>     ,('spark.scheduler.mode', 'FAIR')
>> >>     ,('spark.shuffle.service.enabled', 'true')
>> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> ])
>> >> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>> >>
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



--
Marcelo


--
Jianshi Huang

Reply | Threaded
Open this post in threaded view
|

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Marcelo Vanzin-2
Sorry, I can't help you if that doesn't work. Your YARN RM really
should not have SPARK_HOME set if you want to use more than one Spark
version.
On Thu, Oct 4, 2018 at 9:54 PM Jianshi Huang <[hidden email]> wrote:

>
> Hi Marcelo,
>
> I see what you mean. Tried it but still got same error message.
>
>> Error from python worker:
>>   Traceback (most recent call last):
>>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>>       mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>>       __import__(pkg_name)
>>     File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>>     File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>>   ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>>   /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk3/yarn/usercache/jianshi.huang/filecache/134/__spark_libs__8468485589501316413.zip/spark-core_2.11-2.3.2.jar
>
>
> On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin <[hidden email]> wrote:
>>
>> Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
>> expanded by the shell).
>>
>> But it's really weird to be setting SPARK_HOME in the environment of
>> your node managers. YARN shouldn't need to know about that.
>> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <[hidden email]> wrote:
>> >
>> > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>> >
>> > The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path.
>> >
>> > I don't want to make any changes to worker node configuration, so any way to override the order?
>> >
>> > Jianshi
>> >
>> > On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <[hidden email]> wrote:
>> >>
>> >> Normally the version of Spark installed on the cluster does not
>> >> matter, since Spark is uploaded from your gateway machine to YARN by
>> >> default.
>> >>
>> >> You probably have some configuration (in spark-defaults.conf) that
>> >> tells YARN to use a cached copy. Get rid of that configuration, and
>> >> you can use whatever version you like.
>> >> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <[hidden email]> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.
>> >> >
>> >> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >> >
>> >> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >> >
>> >> > Here's the error message,
>> >> >>
>> >> >>
>> >> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): org.apache.spark.SparkException:
>> >> >> Error from python worker:
>> >> >> Traceback (most recent call last):
>> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> >> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> >> >> __import__(pkg_name)
>> >> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46, in <module>
>> >> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29, in <module>
>> >> >> ModuleNotFoundError: No module named 'py4j'
>> >> >> PYTHONPATH was:
>> >> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >> >
>> >> >
>> >> > And here's how I started Pyspark session in Jupyter.
>> >> >>
>> >> >>
>> >> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> >> import findspark
>> >> >> findspark.init()
>> >> >> import pyspark
>> >> >> sparkConf = pyspark.SparkConf()
>> >> >> sparkConf.setAll([
>> >> >>     ('spark.cores.max', '96')
>> >> >>     ,('spark.driver.memory', '2g')
>> >> >>     ,('spark.executor.cores', '4')
>> >> >>     ,('spark.executor.instances', '2')
>> >> >>     ,('spark.executor.memory', '4g')
>> >> >>     ,('spark.network.timeout', '800')
>> >> >>     ,('spark.scheduler.mode', 'FAIR')
>> >> >>     ,('spark.shuffle.service.enabled', 'true')
>> >> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> >> ])
>> >> >> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf, pyFiles=py_files)
>> >> >>
>> >> >
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > Jianshi Huang
>> >> >
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>> >
>> > --
>> > Jianshi Huang
>> >
>> > LinkedIn: jianshi
>> > Twitter: @jshuang
>> > Github & Blog: http://huangjs.github.com/
>>
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>


--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]