|
|
Hello!
I have been tearing my hair out trying to solve this problem. Here is my setup:
1. I have Spark running on a server in standalone mode with data on the filesystem of the server itself (/opt/data/). 2. I have an instance of a Hive Metastore server running (backed by MariaDB) on the same server 3. I have a laptop where I am developing my spark jobs (Scala)
I have configured Spark to use the metastore and set the warehouse directory to be in /opt/data/warehouse/. What I am trying to accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have the driver run on my local machine (my laptop). I want the jobs to use the data ON THE SERVER and not try to reference it from my local machine. If I do something like this:
val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist (because it's trying to find it on my laptop). If I run the same thing in a spark-shell on the spark server itself, there isn't an issue because the driver has access to the data. If I submit the job with submit-mode=cluster then it works too because the driver is on the cluster. I don't want this, I want to get the results on my laptop.
How can I force Spark to read the data from the cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a table (in the spark shell on the spark server itself). The data in the warehouse is in the local filesystem. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated!
-Ryan Victory
|
|
Hi Ryan,
since the driver is at your laptop, in order to access a remote file you
need to specify the url for this I guess.
For example, when I am using Spark over HDFS I specify the file like
hdfs://blablabla which contains the url where namenode
can answer. I believe that something similar must be done here.
all the best,
Apostolos
On 25/11/20 16:51, Ryan Victory wrote:
> Hello!
>
> I have been tearing my hair out trying to solve this problem. Here is
> my setup:
>
> 1. I have Spark running on a server in standalone mode with data on
> the filesystem of the server itself (/opt/data/).
> 2. I have an instance of a Hive Metastore server running (backed by
> MariaDB) on the same server
> 3. I have a laptop where I am developing my spark jobs (Scala)
>
> I have configured Spark to use the metastore and set the warehouse
> directory to be in /opt/data/warehouse/. What I am trying to
> accomplish are a couple of things:
>
> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> have the driver run on my local machine (my laptop). I want the jobs
> to use the data ON THE SERVER and not try to reference it from my
> local machine. If I do something like this:
>
> val df = spark.sql("SELECT * FROM
> parquet.`/opt/data/transactions.parquet`")
>
> I get an error that the path doesn't exist (because it's trying to
> find it on my laptop). If I run the same thing in a spark-shell on the
> spark server itself, there isn't an issue because the driver has
> access to the data. If I submit the job with submit-mode=cluster then
> it works too because the driver is on the cluster. I don't want this,
> I want to get the results on my laptop.
>
> How can I force Spark to read the data from the cluster's filesystem
> and not the driver's?
>
> 2. I have setup a Hive Metastore and created a table (in the spark
> shell on the spark server itself). The data in the warehouse is in the
> local filesystem. When I create a spark application JAR and try to run
> it from my laptop, I get the same problem as #1, namely that it tries
> to find the warehouse directory on my laptop itself.
>
> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> or insights are much appreciated!
>
> -Ryan Victory
--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: [hidden email]
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
|
|
Thanks Apostolos,
I'm trying to avoid standing up HDFS just for this use case (single node).
-Ryan On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos < [hidden email]> wrote: Hi Ryan,
since the driver is at your laptop, in order to access a remote file you
need to specify the url for this I guess.
For example, when I am using Spark over HDFS I specify the file like
hdfs://blablabla which contains the url where namenode
can answer. I believe that something similar must be done here.
all the best,
Apostolos
On 25/11/20 16:51, Ryan Victory wrote:
> Hello!
>
> I have been tearing my hair out trying to solve this problem. Here is
> my setup:
>
> 1. I have Spark running on a server in standalone mode with data on
> the filesystem of the server itself (/opt/data/).
> 2. I have an instance of a Hive Metastore server running (backed by
> MariaDB) on the same server
> 3. I have a laptop where I am developing my spark jobs (Scala)
>
> I have configured Spark to use the metastore and set the warehouse
> directory to be in /opt/data/warehouse/. What I am trying to
> accomplish are a couple of things:
>
> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> have the driver run on my local machine (my laptop). I want the jobs
> to use the data ON THE SERVER and not try to reference it from my
> local machine. If I do something like this:
>
> val df = spark.sql("SELECT * FROM
> parquet.`/opt/data/transactions.parquet`")
>
> I get an error that the path doesn't exist (because it's trying to
> find it on my laptop). If I run the same thing in a spark-shell on the
> spark server itself, there isn't an issue because the driver has
> access to the data. If I submit the job with submit-mode=cluster then
> it works too because the driver is on the cluster. I don't want this,
> I want to get the results on my laptop.
>
> How can I force Spark to read the data from the cluster's filesystem
> and not the driver's?
>
> 2. I have setup a Hive Metastore and created a table (in the spark
> shell on the spark server itself). The data in the warehouse is in the
> local filesystem. When I create a spark application JAR and try to run
> it from my laptop, I get the same problem as #1, namely that it tries
> to find the warehouse directory on my laptop itself.
>
> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> or insights are much appreciated!
>
> -Ryan Victory
--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: [hidden email]
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
|
|
I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following.
The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or use the spark-shell while connected to the same system.
On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote: Hello!
I have been tearing my hair out trying to solve this problem. Here is my setup:
1. I have Spark running on a server in standalone mode with data on the filesystem of the server itself (/opt/data/). 2. I have an instance of a Hive Metastore server running (backed by MariaDB) on the same server 3. I have a laptop where I am developing my spark jobs (Scala)
I have configured Spark to use the metastore and set the warehouse directory to be in /opt/data/warehouse/. What I am trying to accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have the driver run on my local machine (my laptop). I want the jobs to use the data ON THE SERVER and not try to reference it from my local machine. If I do something like this:
val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist (because it's trying to find it on my laptop). If I run the same thing in a spark-shell on the spark server itself, there isn't an issue because the driver has access to the data. If I submit the job with submit-mode=cluster then it works too because the driver is on the cluster. I don't want this, I want to get the results on my laptop.
How can I force Spark to read the data from the cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a table (in the spark shell on the spark server itself). The data in the warehouse is in the local filesystem. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated!
-Ryan Victory
|
|
A key part of what I'm trying to do involves NOT having to bring the data "through" the driver in order to get the cluster to work on it (which would involve a network hop from server to laptop and another from laptop to server). I'd rather have the data stay on the server and the driver stay on my laptop if possible, but I'm guessing the Spark APIs/topology wasn't designed that way.
What I was hoping for was some way to be able to say val df = spark.sql("SELECT * FROM parquet.`local:///opt/data/transactions.parquet`") or similar to convince Spark to not move the data. I'd imagine if I used HDFS, data locality would kick in anyways to prevent the network shuffles between the driver and the cluster, but even then I wonder (based on what you guys are saying) if I'm wrong.
Perhaps I'll just have to modify the workflow to move the JAR to the server and execute it from there. This isn't ideal but it's better than nothing.
-Ryan On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho < [hidden email]> wrote: I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following.
The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or use the spark-shell while connected to the same system.
On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote: Hello!
I have been tearing my hair out trying to solve this problem. Here is my setup:
1. I have Spark running on a server in standalone mode with data on the filesystem of the server itself (/opt/data/). 2. I have an instance of a Hive Metastore server running (backed by MariaDB) on the same server 3. I have a laptop where I am developing my spark jobs (Scala)
I have configured Spark to use the metastore and set the warehouse directory to be in /opt/data/warehouse/. What I am trying to accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have the driver run on my local machine (my laptop). I want the jobs to use the data ON THE SERVER and not try to reference it from my local machine. If I do something like this:
val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist (because it's trying to find it on my laptop). If I run the same thing in a spark-shell on the spark server itself, there isn't an issue because the driver has access to the data. If I submit the job with submit-mode=cluster then it works too because the driver is on the cluster. I don't want this, I want to get the results on my laptop.
How can I force Spark to read the data from the cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a table (in the spark shell on the spark server itself). The data in the warehouse is in the local filesystem. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated!
-Ryan Victory
|
|
In your situation, I'd try to do one of the following (in decreasing order of personal preference) - Restructure things so that you can operate on a local data file, at least for the purpose of developing your driver logic. Don't rely on the Metastore or HDFS until you have to. Structure the application logic so it operates on a DataFrame (or Dataset) and doesn't care where it came from. Build this data file from your real data (probably a small subset).
- Develop the logic using spark-shell running on a cluster node, since the environment will be all set up already (which, of course, you already mentioned).
- Set up remote debugging of the driver, open an SSH tunnel to the node, and connect from your local laptop to debug/iterate. Figure out the fastest way to rebuild the jar and scp it up to try again.
A key part of what I'm trying to do involves NOT having to bring the data "through" the driver in order to get the cluster to work on it (which would involve a network hop from server to laptop and another from laptop to server). I'd rather have the data stay on the server and the driver stay on my laptop if possible, but I'm guessing the Spark APIs/topology wasn't designed that way.
What I was hoping for was some way to be able to say val df = spark.sql("SELECT * FROM parquet.`local:///opt/data/transactions.parquet`") or similar to convince Spark to not move the data. I'd imagine if I used HDFS, data locality would kick in anyways to prevent the network shuffles between the driver and the cluster, but even then I wonder (based on what you guys are saying) if I'm wrong.
Perhaps I'll just have to modify the workflow to move the JAR to the server and execute it from there. This isn't ideal but it's better than nothing.
-Ryan
On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho < [hidden email]> wrote: I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following.
The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or use the spark-shell while connected to the same system.
On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote: Hello!
I have been tearing my hair out trying to solve this problem. Here is my setup:
1. I have Spark running on a server in standalone mode with data on the filesystem of the server itself (/opt/data/). 2. I have an instance of a Hive Metastore server running (backed by MariaDB) on the same server 3. I have a laptop where I am developing my spark jobs (Scala)
I have configured Spark to use the metastore and set the warehouse directory to be in /opt/data/warehouse/. What I am trying to accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have the driver run on my local machine (my laptop). I want the jobs to use the data ON THE SERVER and not try to reference it from my local machine. If I do something like this:
val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist (because it's trying to find it on my laptop). If I run the same thing in a spark-shell on the spark server itself, there isn't an issue because the driver has access to the data. If I submit the job with submit-mode=cluster then it works too because the driver is on the cluster. I don't want this, I want to get the results on my laptop.
How can I force Spark to read the data from the cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a table (in the spark shell on the spark server itself). The data in the warehouse is in the local filesystem. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated!
-Ryan Victory
|
|
This is a typical file sharing problem in Spark. Just setting up
HDFS won't solve the problem unless you make your local machine as
part of the cluster. Spark server doesn't share files with your
local machine without mounting drives to each other. The
best/easiest way to share the data between your local machine and
the Spark server machine is to use NFS (as Spark manual
suggests). You can use a common NFS server and mount /opt/data
drive on both local and the server machine, or run NFS on either
machine and mount the /opt/data on the other. Regardless, you
have to ensure that /opt/data on both local and server machine are
pointing to the save physical drive. Also don't forget to relax
the read/write permissions for all on the drive or map the user ID
on both machines.
Using Fuse may be an option on Mac, but NFS is the standard
solution for this type of problem (Mac supports NFS as well).
-- ND
On 11/25/20 10:34 AM, Ryan Victory
wrote:
A key part of what I'm trying to do involves NOT
having to bring the data "through" the driver in order to get
the cluster to work on it (which would involve a network hop
from server to laptop and another from laptop to server). I'd
rather have the data stay on the server and the driver stay on
my laptop if possible, but I'm guessing the Spark APIs/topology
wasn't designed that way.
What I was hoping for was some way to be able to say val df = spark.sql("SELECT * FROM
parquet.`local:///opt/data/transactions.parquet`") or similar
to convince Spark to not move the data. I'd imagine if I
used HDFS, data locality would kick in anyways to prevent
the network shuffles between the driver and the cluster, but
even then I wonder (based on what you guys are saying) if
I'm wrong.
Perhaps I'll just have to
modify the workflow to move the JAR to the server and
execute it from there. This isn't ideal but it's better than
nothing.
-Ryan
On Wed, Nov 25, 2020 at 9:13
AM Chris Coutinho < [hidden email]>
wrote:
I'm also curious if this is possible, so while I can't
offer a solution maybe you could try the following.
The driver and executor nodes need to have access to
the same (distributed) file system, so you could try to
mount the file system to your laptop, locally, and then
try to submit jobs and/or use the spark-shell while
connected to the same system.
On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
Hello!
I have been tearing my hair out trying to solve
this problem. Here is my setup:
1. I have Spark running on a server in standalone
mode with data on the filesystem of the server itself
(/opt/data/).
2. I have an instance of a Hive Metastore server
running (backed by MariaDB) on the same server
3. I have a laptop where I am developing my spark
jobs (Scala)
I have configured Spark to use the metastore and
set the warehouse directory to be in
/opt/data/warehouse/. What I am trying to
accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs)
using spark-submit, but have the driver run on my
local machine (my laptop). I want the jobs to use the
data ON THE SERVER and not try to reference it from my
local machine. If I do something like this:
val df = spark.sql("SELECT * FROM
parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist (because
it's trying to find it on my laptop). If I run the
same thing in a spark-shell on the spark server
itself, there isn't an issue because the driver has
access to the data. If I submit the job with
submit-mode=cluster then it works too because the
driver is on the cluster. I don't want this, I want to
get the results on my laptop.
How can I force Spark to read the data from the
cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a
table (in the spark shell on the spark server itself).
The data in the warehouse is in the local filesystem.
When I create a spark application JAR and try to run
it from my laptop, I get the same problem as #1,
namely that it tries to find the warehouse directory
on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to
use Spark? Any help or insights are much appreciated!
-Ryan Victory
|
|
Ah, I almost forgot that there is an even easier solution for
your problem, namely to use the --files option in spark-submit.
Usage as follows:
--files FILES Comma-separated list of files to be
placed in the working
directory of each executor. File
paths of these files
in executors can be accessed via
SparkFiles.get(fileName).
-- ND
On 11/25/20 9:51 PM, Artemis User
wrote:
This is a typical file sharing problem in Spark. Just setting
up HDFS won't solve the problem unless you make your local
machine as part of the cluster. Spark server doesn't share
files with your local machine without mounting drives to each
other. The best/easiest way to share the data between your
local machine and the Spark server machine is to use NFS (as
Spark manual suggests). You can use a common NFS server and
mount /opt/data drive on both local and the server machine, or
run NFS on either machine and mount the /opt/data on the other.
Regardless, you have to ensure that /opt/data on both local and
server machine are pointing to the save physical drive. Also
don't forget to relax the read/write permissions for all on the
drive or map the user ID on both machines.
Using Fuse may be an option on Mac, but NFS is the standard
solution for this type of problem (Mac supports NFS as well).
-- ND
On 11/25/20 10:34 AM, Ryan Victory
wrote:
A key part of what I'm trying to do involves NOT
having to bring the data "through" the driver in order to get
the cluster to work on it (which would involve a network hop
from server to laptop and another from laptop to server). I'd
rather have the data stay on the server and the driver stay on
my laptop if possible, but I'm guessing the Spark
APIs/topology wasn't designed that way.
What I was hoping for was some way to be able to say val df = spark.sql("SELECT * FROM
parquet.`local:///opt/data/transactions.parquet`") or
similar to convince Spark to not move the data. I'd
imagine if I used HDFS, data locality would kick in
anyways to prevent the network shuffles between the driver
and the cluster, but even then I wonder (based on what you
guys are saying) if I'm wrong.
Perhaps I'll just have to
modify the workflow to move the JAR to the server and
execute it from there. This isn't ideal but it's better
than nothing.
-Ryan
On Wed, Nov 25, 2020 at 9:13
AM Chris Coutinho < [hidden email]>
wrote:
I'm also curious if this is possible, so while I
can't offer a solution maybe you could try the
following.
The driver and executor nodes need to have access to
the same (distributed) file system, so you could try to
mount the file system to your laptop, locally, and then
try to submit jobs and/or use the spark-shell while
connected to the same system.
On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory
wrote:
Hello!
I have been tearing my hair out trying to solve
this problem. Here is my setup:
1. I have Spark running on a server in standalone
mode with data on the filesystem of the server
itself (/opt/data/).
2. I have an instance of a Hive Metastore server
running (backed by MariaDB) on the same server
3. I have a laptop where I am developing my spark
jobs (Scala)
I have configured Spark to use the metastore and
set the warehouse directory to be in
/opt/data/warehouse/. What I am trying to
accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs)
using spark-submit, but have the driver run on my
local machine (my laptop). I want the jobs to use
the data ON THE SERVER and not try to reference it
from my local machine. If I do something like this:
val df = spark.sql("SELECT * FROM
parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist
(because it's trying to find it on my laptop). If I
run the same thing in a spark-shell on the spark
server itself, there isn't an issue because the
driver has access to the data. If I submit the job
with submit-mode=cluster then it works too because
the driver is on the cluster. I don't want this, I
want to get the results on my laptop.
How can I force Spark to read the data from the
cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a
table (in the spark shell on the spark server
itself). The data in the warehouse is in the local
filesystem. When I create a spark application JAR
and try to run it from my laptop, I get the same
problem as #1, namely that it tries to find the
warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to
use Spark? Any help or insights are much
appreciated!
-Ryan Victory
|
|
NFS is a simple option for this kind of usage, yes. But --files is making N copies of the data - you may not want to do that for large data, or for data that you need to mutate.
Ah, I almost forgot that there is an even easier solution for
your problem, namely to use the --files option in spark-submit.
Usage as follows:
--files FILES Comma-separated list of files to be
placed in the working
directory of each executor. File
paths of these files
in executors can be accessed via
SparkFiles.get(fileName).
-- ND
On 11/25/20 9:51 PM, Artemis User
wrote:
This is a typical file sharing problem in Spark. Just setting
up HDFS won't solve the problem unless you make your local
machine as part of the cluster. Spark server doesn't share
files with your local machine without mounting drives to each
other. The best/easiest way to share the data between your
local machine and the Spark server machine is to use NFS (as
Spark manual suggests). You can use a common NFS server and
mount /opt/data drive on both local and the server machine, or
run NFS on either machine and mount the /opt/data on the other.
Regardless, you have to ensure that /opt/data on both local and
server machine are pointing to the save physical drive. Also
don't forget to relax the read/write permissions for all on the
drive or map the user ID on both machines.
Using Fuse may be an option on Mac, but NFS is the standard
solution for this type of problem (Mac supports NFS as well).
-- ND
On 11/25/20 10:34 AM, Ryan Victory
wrote:
A key part of what I'm trying to do involves NOT
having to bring the data "through" the driver in order to get
the cluster to work on it (which would involve a network hop
from server to laptop and another from laptop to server). I'd
rather have the data stay on the server and the driver stay on
my laptop if possible, but I'm guessing the Spark
APIs/topology wasn't designed that way.
What I was hoping for was some way to be able to say val df = spark.sql("SELECT * FROM
parquet.`local:///opt/data/transactions.parquet`") or
similar to convince Spark to not move the data. I'd
imagine if I used HDFS, data locality would kick in
anyways to prevent the network shuffles between the driver
and the cluster, but even then I wonder (based on what you
guys are saying) if I'm wrong.
Perhaps I'll just have to
modify the workflow to move the JAR to the server and
execute it from there. This isn't ideal but it's better
than nothing.
-Ryan
On Wed, Nov 25, 2020 at 9:13
AM Chris Coutinho < [hidden email]>
wrote:
I'm also curious if this is possible, so while I
can't offer a solution maybe you could try the
following.
The driver and executor nodes need to have access to
the same (distributed) file system, so you could try to
mount the file system to your laptop, locally, and then
try to submit jobs and/or use the spark-shell while
connected to the same system.
On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory
wrote:
Hello!
I have been tearing my hair out trying to solve
this problem. Here is my setup:
1. I have Spark running on a server in standalone
mode with data on the filesystem of the server
itself (/opt/data/).
2. I have an instance of a Hive Metastore server
running (backed by MariaDB) on the same server
3. I have a laptop where I am developing my spark
jobs (Scala)
I have configured Spark to use the metastore and
set the warehouse directory to be in
/opt/data/warehouse/. What I am trying to
accomplish are a couple of things:
1. I am trying to submit Spark jobs (via JARs)
using spark-submit, but have the driver run on my
local machine (my laptop). I want the jobs to use
the data ON THE SERVER and not try to reference it
from my local machine. If I do something like this:
val df = spark.sql("SELECT * FROM
parquet.`/opt/data/transactions.parquet`")
I get an error that the path doesn't exist
(because it's trying to find it on my laptop). If I
run the same thing in a spark-shell on the spark
server itself, there isn't an issue because the
driver has access to the data. If I submit the job
with submit-mode=cluster then it works too because
the driver is on the cluster. I don't want this, I
want to get the results on my laptop.
How can I force Spark to read the data from the
cluster's filesystem and not the driver's?
2. I have setup a Hive Metastore and created a
table (in the spark shell on the spark server
itself). The data in the warehouse is in the local
filesystem. When I create a spark application JAR
and try to run it from my laptop, I get the same
problem as #1, namely that it tries to find the
warehouse directory on my laptop itself.
Am I crazy? Perhaps this isn't a supported way to
use Spark? Any help or insights are much
appreciated!
-Ryan Victory
|
|