[Arrow][Dremio]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Arrow][Dremio]

xmehaut
Hello,
I've some question about Spark and Apache Arrow. Up to now, Arrow is only
used for sharing data between Python and Spark executors instead of
transmitting them through sockets. I'm studying currently Dremio as an
interesting way to access multiple sources of data, and as a potential
replacement of ETL tools, included sparksql.
It seems, if the promises are actually right, that arrow and dremio may be
changing game for these two purposes (data source abstraction, etl tasks),
leaving then spark on te two following goals , ie ml/dl and graph
processing, which can be a danger for spark at middle term with the arising
of multiple frameworks in these areas.
My question is then :
- is there a means to use arrow more broadly in spark itself and not only
for sharing data?
- what are the strenghts and weaknesses of spark wrt Arrow and consequently
Dremio?
- What is the difference finally between databricks DBIO and Dremio/arrow?
-How do you see the future of spark regarding these assumptions?
regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Arrow][Dremio]

smikesh
Hi Xavier, 

Dremio is looking really interesting and has nice UI. I think the idea to replace SSIS or similar tools with Dremio is not so bad, but what about complex scenarios with a lot of code and transformations ? 
Is it possible to use Dremio via API and define own transformations and transformation workflows with Java or Scala in Dremio?
I am not sure, if it is supported at all. 
I think Dremio guys are looking forward to give users access to Sabot API in order to use Dremio in the same way you can use spark, but I am not sure if it is possible now. 
Have you also tried comparing performance with Spark ? Are there any benchmarks ?

Best,
Michael

On Mon, May 14, 2018 at 6:53 AM, xmehaut <[hidden email]> wrote:
Hello,
I've some question about Spark and Apache Arrow. Up to now, Arrow is only
used for sharing data between Python and Spark executors instead of
transmitting them through sockets. I'm studying currently Dremio as an
interesting way to access multiple sources of data, and as a potential
replacement of ETL tools, included sparksql.
It seems, if the promises are actually right, that arrow and dremio may be
changing game for these two purposes (data source abstraction, etl tasks),
leaving then spark on te two following goals , ie ml/dl and graph
processing, which can be a danger for spark at middle term with the arising
of multiple frameworks in these areas.
My question is then :
- is there a means to use arrow more broadly in spark itself and not only
for sharing data?
- what are the strenghts and weaknesses of spark wrt Arrow and consequently
Dremio?
- What is the difference finally between databricks DBIO and Dremio/arrow?
-How do you see the future of spark regarding these assumptions?
regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Arrow][Dremio]

xmehaut
Hi Michaël,

I'm not an expert of Dremio, i just try to evaluate the potential of this
techno and what impacts it could have on spark, and how they can work
together, or how spark could use even further arrow internally along the
existing algorithms.

Dremio has already a quite rich api set enabling to access for instance to
metadata, sql queries, or even to create virtual datasets programmatically.
They also have a lot of predefined functions, and I imagine there will be
more an more fucntions in the future, eg machine learning functions like the
ones we may find in azure sql server which enables to mix sql and ml
functions.  Acces to dremio is made through jdbc, and we may imagine to
access virtual datasets through spark and create dynamically new datasets
from the api connected to parquets files stored dynamycally by spark on
hdfs, azure datalake or s3... Of course a more thight integration between
both should be better with a spark read/write connector to dremio :)

regards
xavier



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Arrow][Dremio]

Pierce Lamb
Hi Xavier,

Along the lines of connecting to multiple sources of data and replacing ETL tools you may want to check out Confluent's blog on building a real-time streaming ETL pipeline on Kafka as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData where Spark is central to connecting to multiple data sources, executing SQL on streams etc. These should provide nice comparisons to your ideas about Dremio + Spark as ETL tools.

Disclaimer: I am a SnappyData employee

Hope this helps,

Pierce

On Mon, May 14, 2018 at 2:24 AM, xmehaut <[hidden email]> wrote:
Hi Michaël,

I'm not an expert of Dremio, i just try to evaluate the potential of this
techno and what impacts it could have on spark, and how they can work
together, or how spark could use even further arrow internally along the
existing algorithms.

Dremio has already a quite rich api set enabling to access for instance to
metadata, sql queries, or even to create virtual datasets programmatically.
They also have a lot of predefined functions, and I imagine there will be
more an more fucntions in the future, eg machine learning functions like the
ones we may find in azure sql server which enables to mix sql and ml
functions.  Acces to dremio is made through jdbc, and we may imagine to
access virtual datasets through spark and create dynamically new datasets
from the api connected to parquets files stored dynamycally by spark on
hdfs, azure datalake or s3... Of course a more thight integration between
both should be better with a spark read/write connector to dremio :)

regards
xavier



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Arrow][Dremio]

Bryan Cutler
Hi Xavier,

Regarding Arrow usage in Spark, using Arrow format to transfer data between Python and Java has been the focus so far because this area stood to benefit the most.  It's possible that the scope of Arrow could broaden in the future, but there still needs to be discussions about this.

Bryan

On Mon, May 14, 2018 at 9:55 AM, Pierce Lamb <[hidden email]> wrote:
Hi Xavier,

Along the lines of connecting to multiple sources of data and replacing ETL tools you may want to check out Confluent's blog on building a real-time streaming ETL pipeline on Kafka as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData where Spark is central to connecting to multiple data sources, executing SQL on streams etc. These should provide nice comparisons to your ideas about Dremio + Spark as ETL tools.

Disclaimer: I am a SnappyData employee

Hope this helps,

Pierce

On Mon, May 14, 2018 at 2:24 AM, xmehaut <[hidden email]> wrote:
Hi Michaël,

I'm not an expert of Dremio, i just try to evaluate the potential of this
techno and what impacts it could have on spark, and how they can work
together, or how spark could use even further arrow internally along the
existing algorithms.

Dremio has already a quite rich api set enabling to access for instance to
metadata, sql queries, or even to create virtual datasets programmatically.
They also have a lot of predefined functions, and I imagine there will be
more an more fucntions in the future, eg machine learning functions like the
ones we may find in azure sql server which enables to mix sql and ml
functions.  Acces to dremio is made through jdbc, and we may imagine to
access virtual datasets through spark and create dynamically new datasets
from the api connected to parquets files stored dynamycally by spark on
hdfs, azure datalake or s3... Of course a more thight integration between
both should be better with a spark read/write connector to dremio :)

regards
xavier



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: [Arrow][Dremio]

xmehaut
thanks bryan for the answer

Envoyé de mon iPhone

Le 15 mai 2018 à 19:06, Bryan Cutler <[hidden email]> a écrit :

Hi Xavier,

Regarding Arrow usage in Spark, using Arrow format to transfer data between Python and Java has been the focus so far because this area stood to benefit the most.  It's possible that the scope of Arrow could broaden in the future, but there still needs to be discussions about this.

Bryan

On Mon, May 14, 2018 at 9:55 AM, Pierce Lamb <[hidden email]> wrote:
Hi Xavier,

Along the lines of connecting to multiple sources of data and replacing ETL tools you may want to check out Confluent's blog on building a real-time streaming ETL pipeline on Kafka as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData where Spark is central to connecting to multiple data sources, executing SQL on streams etc. These should provide nice comparisons to your ideas about Dremio + Spark as ETL tools.

Disclaimer: I am a SnappyData employee

Hope this helps,

Pierce

On Mon, May 14, 2018 at 2:24 AM, xmehaut <[hidden email]> wrote:
Hi Michaël,

I'm not an expert of Dremio, i just try to evaluate the potential of this
techno and what impacts it could have on spark, and how they can work
together, or how spark could use even further arrow internally along the
existing algorithms.

Dremio has already a quite rich api set enabling to access for instance to
metadata, sql queries, or even to create virtual datasets programmatically.
They also have a lot of predefined functions, and I imagine there will be
more an more fucntions in the future, eg machine learning functions like the
ones we may find in azure sql server which enables to mix sql and ml
functions.  Acces to dremio is made through jdbc, and we may imagine to
access virtual datasets through spark and create dynamically new datasets
from the api connected to parquets files stored dynamycally by spark on
hdfs, azure datalake or s3... Of course a more thight integration between
both should be better with a spark read/write connector to dremio :)

regards
xavier



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]