how to serve data over JDBC using simplest setup

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

how to serve data over JDBC using simplest setup

Scott Ribe
I need a little help figuring out how some pieces fit together. I have some tables in parquet files, and I want to access them using SQL over JDBC. I gather that I need to run the thrift server, but how do I configure it to load my files into datasets and expose views?

The context is this: trying to figure out if we want to use Spark for historical data, and so far, just using spark shell for some experiments:

- I have established that we can easily export to Parquet and it is very efficient at storing this data
- Spark SQL queries the data with reasonable performance

Now I am at the step of testing whether the client-side that we are considering can deal effectively with querying the volume of data.

Which is why I'm looking for the simplest setup. If the client integration works, then yes we move on to configuring a proper cluster. (And it is a real question, I've already had one potential client-side piece be totally incompetent at handling a decent volume of data...)

(The environment I am working in is just the straight download of spark-3.0.1-bin-hadoop3.2)

--
Scott Ribe
[hidden email]
https://www.linkedin.com/in/scottribe/




---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Jeff Evans
If the data is already in Parquet files, I don't see any reason to involve JDBC at all.  You can read Parquet files directly into a DataFrame.  https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

On Thu, Feb 18, 2021 at 1:42 PM Scott Ribe <[hidden email]> wrote:
I need a little help figuring out how some pieces fit together. I have some tables in parquet files, and I want to access them using SQL over JDBC. I gather that I need to run the thrift server, but how do I configure it to load my files into datasets and expose views?

The context is this: trying to figure out if we want to use Spark for historical data, and so far, just using spark shell for some experiments:

- I have established that we can easily export to Parquet and it is very efficient at storing this data
- Spark SQL queries the data with reasonable performance

Now I am at the step of testing whether the client-side that we are considering can deal effectively with querying the volume of data.

Which is why I'm looking for the simplest setup. If the client integration works, then yes we move on to configuring a proper cluster. (And it is a real question, I've already had one potential client-side piece be totally incompetent at handling a decent volume of data...)

(The environment I am working in is just the straight download of spark-3.0.1-bin-hadoop3.2)

--
Scott Ribe
[hidden email]
https://www.linkedin.com/in/scottribe/




---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Scott Ribe
I have a client side piece that needs access via JDBC.

> On Feb 18, 2021, at 12:45 PM, Jeff Evans <[hidden email]> wrote:
>
> If the data is already in Parquet files, I don't see any reason to involve JDBC at all.  You can read Parquet files directly into a DataFrame.  https://spark.apache.org/docs/latest/sql-data-sources-parquet.html


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Jeff Evans
It sounds like the tool you're after, then, is a distributed SQL engine like Presto.  But I could be totally misunderstanding what you're trying to do.

On Thu, Feb 18, 2021 at 1:48 PM Scott Ribe <[hidden email]> wrote:
I have a client side piece that needs access via JDBC.

> On Feb 18, 2021, at 12:45 PM, Jeff Evans <[hidden email]> wrote:
>
> If the data is already in Parquet files, I don't see any reason to involve JDBC at all.  You can read Parquet files directly into a DataFrame.  https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Lalwani, Jayesh
In reply to this post by Scott Ribe
There are several step by step guides that you can find online by googling

https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-thrift-server.html
https://medium.com/@saipeddy/setting-up-a-thrift-server-4eb0c55c11f0
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_spark-component-guide/content/config-sts.html

Have you tried any of those? Where are you getting stuck?


On 2/18/21, 2:44 PM, "Scott Ribe" <[hidden email]> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    I need a little help figuring out how some pieces fit together. I have some tables in parquet files, and I want to access them using SQL over JDBC. I gather that I need to run the thrift server, but how do I configure it to load my files into datasets and expose views?

    The context is this: trying to figure out if we want to use Spark for historical data, and so far, just using spark shell for some experiments:

    - I have established that we can easily export to Parquet and it is very efficient at storing this data
    - Spark SQL queries the data with reasonable performance

    Now I am at the step of testing whether the client-side that we are considering can deal effectively with querying the volume of data.

    Which is why I'm looking for the simplest setup. If the client integration works, then yes we move on to configuring a proper cluster. (And it is a real question, I've already had one potential client-side piece be totally incompetent at handling a decent volume of data...)

    (The environment I am working in is just the straight download of spark-3.0.1-bin-hadoop3.2)

    --
    Scott Ribe
    [hidden email]
    https://www.linkedin.com/in/scottribe/




    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Scott Ribe
> On Feb 18, 2021, at 1:13 PM, Lalwani, Jayesh <[hidden email]> wrote:
>
> Have you tried any of those? Where are you getting stuck?

Thanks! The 3rd one in your list I had not found, and it seems to fill in what I was missing (CREATE EXTERNAL TABLE).

I'd found the first two, but they only got me creating and querying tables in spark shell, or launching a hive server that had no data. (Google had also provided me with a wide variety of irrelevant material--mostly about using JDBC from within spark to import data, which I had figured out pretty quickly.)


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Scott Ribe
In reply to this post by Jeff Evans
> On Feb 18, 2021, at 12:52 PM, Jeff Evans <[hidden email]> wrote:
>
> It sounds like the tool you're after, then, is a distributed SQL engine like Presto.  But I could be totally misunderstanding what you're trying to do.

Presto may well be a longer-term solution as our use grows. For now, a simple data set loaded into spark and served via JDBC (to be accessed via a Postgres foreign data wrapper) will get us the next small step.
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to serve data over JDBC using simplest setup

Lalwani, Jayesh
Presto has slightly lower latency than Spark, but I've found that it gets stuck on some edge cases.

If you are on AWS, then the simplest solution is to use Athena. Athena is built on Presto, has a JDBC driver, and is serverless, so you don't have to take any headaches

On 2/18/21, 3:32 PM, "Scott Ribe" <[hidden email]> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    > On Feb 18, 2021, at 12:52 PM, Jeff Evans <[hidden email]> wrote:
    >
    > It sounds like the tool you're after, then, is a distributed SQL engine like Presto.  But I could be totally misunderstanding what you're trying to do.

    Presto may well be a longer-term solution as our use grows. For now, a simple data set loaded into spark and served via JDBC (to be accessed via a Postgres foreign data wrapper) will get us the next small step.
    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]