Hive From Spark: Jdbc VS sparkContext

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
Hi

I wonder the differences accessing HIVE tables in two different ways:
- with jdbc access
- with sparkContext

I would say that jdbc is better since it uses HIVE that is based on
map-reduce / TEZ and then works on disk.
Using spark rdd can lead to memory errors on very huge datasets.


Anybody knows or can point me to relevant documentation ?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Gourav Sengupta
Hi,

I am genuinely curious to see whether any one responds to this question.

Its very hard to shake off JAVA, OOPs and JDBC's :)



Regards,
Gourav Sengupta 

On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris <[hidden email]> wrote:
Hi

I wonder the differences accessing HIVE tables in two different ways:
- with jdbc access
- with sparkContext

I would say that jdbc is better since it uses HIVE that is based on
map-reduce / TEZ and then works on disk.
Using spark rdd can lead to memory errors on very huge datasets.


Anybody knows or can point me to relevant documentation ?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

ayan guha
Well the obvious point is security. Ranger and Sentry can secure jdbc endpoints only. For performance aspect, I am equally curious 🤓

On Wed, 4 Oct 2017 at 10:30 pm, Gourav Sengupta <[hidden email]> wrote:
Hi,

I am genuinely curious to see whether any one responds to this question.

Its very hard to shake off JAVA, OOPs and JDBC's :)



Regards,
Gourav Sengupta 

On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris <[hidden email]> wrote:
Hi

I wonder the differences accessing HIVE tables in two different ways:
- with jdbc access
- with sparkContext

I would say that jdbc is better since it uses HIVE that is based on
map-reduce / TEZ and then works on disk.
Using spark rdd can lead to memory errors on very huge datasets.


Anybody knows or can point me to relevant documentation ?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

郭鹏飞
In reply to this post by Nicolas Paris

> 在 2017年10月4日,上午2:08,Nicolas Paris <[hidden email]> 写道:
>
> Hi
>
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
>
> I would say that jdbc is better since it uses HIVE that is based on
> map-reduce / TEZ and then works on disk.
> Using spark rdd can lead to memory errors on very huge datasets.
>
>
> Anybody knows or can point me to relevant documentation ?
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]


The jdbc will load data into the driver node, this may slow down the speed,and may OOM.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

ayan guha
That is not correct, IMHO. If I am not wrong, Spark will still load data in executor, by running some stats on the data itself to identify partitions....

On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞 <[hidden email]> wrote:

> 在 2017年10月4日,上午2:08,Nicolas Paris <[hidden email]> 写道:
>
> Hi
>
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
>
> I would say that jdbc is better since it uses HIVE that is based on
> map-reduce / TEZ and then works on disk.
> Using spark rdd can lead to memory errors on very huge datasets.
>
>
> Anybody knows or can point me to relevant documentation ?
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]


The jdbc will load data into the driver node, this may slow down the speed,and may OOM.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

weand
Is Hive from Spark via JDBC working for you? In case it does, I would be
interested in your setup :-)

We can't get this working. See bug here, especially my last comment:
https://issues.apache.org/jira/browse/SPARK-21063

Regards
Andreas



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Hive From Spark: Jdbc VS sparkContext

Walia, Reema
I am able to connect to Spark via JDBC - tested with Squirrel. I am referencing all the jars of current Spark distribution under /usr/hdp/current/spark2-client/jars/*

Thanks,
Reema


-----Original Message-----
From: weand [mailto:[hidden email]]
Sent: Tuesday, October 10, 2017 5:14 PM
To: [hidden email]
Subject: Re: Hive From Spark: Jdbc VS sparkContext

  [ External Email ]

Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-)

We can't get this working. See bug here, especially my last comment:
https://issues.apache.org/jira/browse/SPARK-21063

Regards
Andreas



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

_________________________________________________

This message is for the designated recipient only and may contain privileged, proprietary
or otherwise private information. If you have received it in error, please notify the sender
immediately and delete the original. Any other use of the email by you is prohibited.

Dansk - Deutsch - Espanol - Francais - Italiano - Japanese - Nederlands - Norsk - Portuguese - Chinese
Svenska: http://www.cardinalhealth.com/en/support/terms-and-conditions-english.html


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Gourav Sengupta
In reply to this post by weand
Hi,

I do not think that SPARK will automatically determine the partitions. Actually it does not automatically determine the partitions. In case a table has a few million records, it all goes through the driver.


Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres.

Regards,
Gourav Sengupta

On Tue, Oct 10, 2017 at 10:14 PM, weand <[hidden email]> wrote:
Is Hive from Spark via JDBC working for you? In case it does, I would be
interested in your setup :-)

We can't get this working. See bug here, especially my last comment:
https://issues.apache.org/jira/browse/SPARK-21063

Regards
Andreas



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
> In case a table has a few
> million records, it all goes through the driver.

This sounds clear in JDBC mode, the driver get all the rows and then it
spreads the RDD over the executors.

I d'say that most use cases deal with SQL to aggregate huge datasets,
and retrieve small amount of rows to be then transformed for ML tasks.
Then using JDBC offers the robustness of HIVE to produce a small aggregated
dataset into spark. While using SPARK SQL uses RDD to produce the small
one from huge.

Not very clear how SPARK SQL deal with huge HIVE table. Does it load
everything into memory and crash, or does this never happend?


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Kabeer Ahmed
My take on this might sound a bit different. Here are few points to consider below:

1. Going through  Hive JDBC means that the application is restricted by the # of queries that can be compiled. HS2 can only compile one SQL at a time and if users have bad SQL, it can take a long time just to compile (not map reduce). This will reduce the query throughput i.e. # of queries you can fire through the JDBC.

2. Going through Hive JDBC does have an advantage that HMS service is protected. The JIRA: https://issues.apache.org/jira/browse/HIVE-13884 does protect HMS from crashing - because at the end of the day retrieving metadata about a Hive table that may have millions or simply put 1000s of partitions hits jvm limit on the array size that it can hold for the metadata retrieved. JVM array size limit is hit and there is a crash on HMS. So in effect this is good to have to protect HMS & the relational database on its back end.

Note: Hive community does propose to move the database to HBase that scales but I dont think this will get implemented sooner.

3. Going through the SparkContext, it directly interfaces with the Hive MetaStore. I have tried to put a sequence of code flow below. The bit I didnt have time to dive into is that I believe if the table is really large i.e. say partitions in the table are more than 32K (size of a short) then some sort of slicing does occur (I didnt have time to dive and get this piece of code but from experience this does seem to occur).

Code flow:
Spark uses Hive External catalog - goo.gl/7CZcDw
HiveClient version of getPartitions is -> goo.gl/ZAEsqQ
HiveClientImpl of getPartitions is: -> goo.gl/msPrr5
The Hive call is made at: -> goo.gl/TB4NFU
ThriftHiveMetastore.java ->  get_partitions_ps_with_auth

-1 value is sent within Spark all the way throughout to Hive Metastore thrift. So in effect for large tables at a time 32K partitions are retrieved. This also has led to a few HMS crashes but I am yet to identify if this is really the cause.


Based on the 3 points above, I would prefer to use SparkContext. If the cause of crash is indeed high # of partitions retrieval, then I may opt for the JDBC route.

Thanks
Kabeer.


On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote:

>> In case a table has a few
>> million records, it all goes through the driver.
>
> This sounds clear in JDBC mode, the driver get all the rows and then it
> spreads the RDD over the executors.
>
> I d'say that most use cases deal with SQL to aggregate huge datasets,
> and retrieve small amount of rows to be then transformed for ML tasks.
> Then using JDBC offers the robustness of HIVE to produce a small aggregated
> dataset into spark. While using SPARK SQL uses RDD to produce the small
> one from huge.
>
> Not very clear how SPARK SQL deal with huge HIVE table. Does it load
> everything into memory and crash, or does this never happend?
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>
>


--
Sent using Dekko from my Ubuntu device

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
In reply to this post by Nicolas Paris
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext

Well there is also a third way to access the hive data from spark:
- with direct file access (here ORC format)


For example:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_people")
people.createOrReplaceTempView("people")
sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()


This method looks much faster than both:
- with jdbc access
- with sparkContext

Any experience on that ?


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Gourav Sengupta
Hi Nicolas,

what if the table has partitions and sub-partitions? And you do not want to access the entire data?


Regards,
Gourav

On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris <[hidden email]> wrote:
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext

Well there is also a third way to access the hive data from spark:
- with direct file access (here ORC format)


For example:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_people")
people.createOrReplaceTempView("people")
sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()


This method looks much faster than both:
- with jdbc access
- with sparkContext

Any experience on that ?


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
Hi Gourav

> what if the table has partitions and sub-partitions?

well this also work with multiple orc files having same schema:
val people = sqlContext.read.format("orc").load("hdfs://cluster/people*")
Am I missing something?

> And you do not want to access the entire data?

This works for static datasets, or when new data is comming by batch
processes, the spark application should be reloaded to get the new files
in the folder


>> On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris <[hidden email]> wrote:
>
>     Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
>     > I wonder the differences accessing HIVE tables in two different ways:
>     > - with jdbc access
>     > - with sparkContext
>
>     Well there is also a third way to access the hive data from spark:
>     - with direct file access (here ORC format)
>
>
>     For example:
>
>     val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>     sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
>     val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_
>     people")
>     people.createOrReplaceTempView("people")
>     sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()
>
>
>     This method looks much faster than both:
>     - with jdbc access
>     - with sparkContext
>
>     Any experience on that ?
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
In reply to this post by Gourav Sengupta
> I do not think that SPARK will automatically determine the partitions. Actually
> it does not automatically determine the partitions. In case a table has a few
> million records, it all goes through the driver.

Hi Gourav

Actualy spark jdbc driver is able to deal direclty with partitions.
Sparks creates a jdbc connection for each partition.

All details explained in this post :
http://www.gatorsmile.io/numpartitionsinjdbc/

Also an example with greenplum database:
http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Gourav Sengupta
Hi Nicolas,

without the hive thrift server, if you try to run a select * on a table which has around 10,000 partitions, SPARK will give you some surprises. PRESTO works fine in these scenarios, and I am sure SPARK community will soon learn from their algorithms.


Regards,
Gourav

On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <[hidden email]> wrote:
> I do not think that SPARK will automatically determine the partitions. Actually
> it does not automatically determine the partitions. In case a table has a few
> million records, it all goes through the driver.

Hi Gourav

Actualy spark jdbc driver is able to deal direclty with partitions.
Sparks creates a jdbc connection for each partition.

All details explained in this post :
http://www.gatorsmile.io/numpartitionsinjdbc/

Also an example with greenplum database:
http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
Hi

After some testing, I have been quite disapointed with hiveContext way of
accessing hive tables.

The main problem is resource allocation: I have tons of users and they
get a limited subset of workers. Then this does not allow to query huge
datasetsn because to few memory allocated (or maybe I am missing
something).

If using Hive jdbc, Hive resources are shared by all my users and then
queries are able to finish.

Then I have been testing other jdbc based approach and for now, "presto"
looks like the most appropriate solution to access hive tables.

In order to load huge datasets into spark, the proposed approach is to
use presto distributed CTAS to build an ORC dataset, and access to that
dataset from spark dataframe loader ability (instead of direct jdbc
access tha would break the driver memory).



Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :

> Hi Nicolas,
>
> without the hive thrift server, if you try to run a select * on a table which
> has around 10,000 partitions, SPARK will give you some surprises. PRESTO works
> fine in these scenarios, and I am sure SPARK community will soon learn from
> their algorithms.
>
>
> Regards,
> Gourav
>
> On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <[hidden email]> wrote:
>
>     > I do not think that SPARK will automatically determine the partitions.
>     Actually
>     > it does not automatically determine the partitions. In case a table has a
>     few
>     > million records, it all goes through the driver.
>
>     Hi Gourav
>
>     Actualy spark jdbc driver is able to deal direclty with partitions.
>     Sparks creates a jdbc connection for each partition.
>
>     All details explained in this post :
>     http://www.gatorsmile.io/numpartitionsinjdbc/
>
>     Also an example with greenplum database:
>     http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Gourav Sengupta
Hi Nicolas,


thanks a ton for your kind response. Have you used SPARK Session ? I think that hiveContext is a very old way of solving things in SPARK, and since then new algorithms have been introduced in SPARK. 

It will be a lot of help, given how kind you have been by sharing your experience, if you could kindly share your code as well and provide details like SPARK , HADOOP, HIVE, and other environment version and details.

After all, no one wants to use SPARK 1.x version to solve problems anymore, though I have seen couple of companies who are stuck with these versions as they are using in house deployments which they cannot upgrade because of incompatibility issues.


Regards,
Gourav Sengupta


On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris <[hidden email]> wrote:
Hi

After some testing, I have been quite disapointed with hiveContext way of
accessing hive tables.

The main problem is resource allocation: I have tons of users and they
get a limited subset of workers. Then this does not allow to query huge
datasetsn because to few memory allocated (or maybe I am missing
something).

If using Hive jdbc, Hive resources are shared by all my users and then
queries are able to finish.

Then I have been testing other jdbc based approach and for now, "presto"
looks like the most appropriate solution to access hive tables.

In order to load huge datasets into spark, the proposed approach is to
use presto distributed CTAS to build an ORC dataset, and access to that
dataset from spark dataframe loader ability (instead of direct jdbc
access tha would break the driver memory).



Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> Hi Nicolas,
>
> without the hive thrift server, if you try to run a select * on a table which
> has around 10,000 partitions, SPARK will give you some surprises. PRESTO works
> fine in these scenarios, and I am sure SPARK community will soon learn from
> their algorithms.
>
>
> Regards,
> Gourav
>
> On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <[hidden email]> wrote:
>
>     > I do not think that SPARK will automatically determine the partitions.
>     Actually
>     > it does not automatically determine the partitions. In case a table has a
>     few
>     > million records, it all goes through the driver.
>
>     Hi Gourav
>
>     Actualy spark jdbc driver is able to deal direclty with partitions.
>     Sparks creates a jdbc connection for each partition.
>
>     All details explained in this post :
>     http://www.gatorsmile.io/numpartitionsinjdbc/
>
>     Also an example with greenplum database:
>     http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

Nicolas Paris
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait :
> thanks a ton for your kind response. Have you used SPARK Session ? I think that
> hiveContext is a very old way of solving things in SPARK, and since then new
> algorithms have been introduced in SPARK. 

I will give a try out sparkSession.

> It will be a lot of help, given how kind you have been by sharing your
> experience, if you could kindly share your code as well and provide details
> like SPARK , HADOOP, HIVE, and other environment version and details.

I am testing a HDP 2.6 distrib and also:
SPARK: 2.1.1
HADOOP: 2.7.3
HIVE: 1.2.1000
PRESTO: 1.87

> After all, no one wants to use SPARK 1.x version to solve problems anymore,
> though I have seen couple of companies who are stuck with these versions as
> they are using in house deployments which they cannot upgrade because of
> incompatibility issues.

Didn't know hiveContext was legacy spark way. I will give a try to
sparkSession and conclude. After all, I would prefer to provide our
users, a unique and uniform framework such spark, instead of multiple
complicated layers such spark + whatever jdbc access

>
>
> Regards,
> Gourav Sengupta
>
>
> On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris <[hidden email]> wrote:
>
>     Hi
>
>     After some testing, I have been quite disapointed with hiveContext way of
>     accessing hive tables.
>
>     The main problem is resource allocation: I have tons of users and they
>     get a limited subset of workers. Then this does not allow to query huge
>     datasetsn because to few memory allocated (or maybe I am missing
>     something).
>
>     If using Hive jdbc, Hive resources are shared by all my users and then
>     queries are able to finish.
>
>     Then I have been testing other jdbc based approach and for now, "presto"
>     looks like the most appropriate solution to access hive tables.
>
>     In order to load huge datasets into spark, the proposed approach is to
>     use presto distributed CTAS to build an ORC dataset, and access to that
>     dataset from spark dataframe loader ability (instead of direct jdbc
>     access tha would break the driver memory).
>
>
>
>     Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
>     > Hi Nicolas,
>     >
>     > without the hive thrift server, if you try to run a select * on a table
>     which
>     > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
>     works
>     > fine in these scenarios, and I am sure SPARK community will soon learn
>     from
>     > their algorithms.
>     >
>     >
>     > Regards,
>     > Gourav
>     >
>     > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <[hidden email]>
>     wrote:
>     >
>     >     > I do not think that SPARK will automatically determine the
>     partitions.
>     >     Actually
>     >     > it does not automatically determine the partitions. In case a table
>     has a
>     >     few
>     >     > million records, it all goes through the driver.
>     >
>     >     Hi Gourav
>     >
>     >     Actualy spark jdbc driver is able to deal direclty with partitions.
>     >     Sparks creates a jdbc connection for each partition.
>     >
>     >     All details explained in this post :
>     >     http://www.gatorsmile.io/numpartitionsinjdbc/
>     >
>     >     Also an example with greenplum database:
>     >     http://engineering.pivotal.io/post/getting-started-with-
>     greenplum-spark/
>     >
>     >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Hive From Spark: Jdbc VS sparkContext

David Hodeffi
Testing Spark group e-mail

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hive From Spark: Jdbc VS sparkContext

ayan guha
Hi

Can you confirm if JDBC DF Reader actually loads all data from source to driver memory and then distributes to the executors? And this is true even when a partition column is provided?

Best
Ayan

On Mon, Nov 6, 2017 at 3:00 AM, David Hodeffi <[hidden email]> wrote:
Testing Spark group e-mail

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha
12