Join pushdown on two external tables from the same external source?
This post has NOT been accepted by the mailing list yet.
I'm trying to figure out how to multiple tables from a single external source directly in spark sql. Say I do the following in spark SQL:
CREATE OR REPLACE TEMPORARY VIEW t1 USING jdbc OPTIONS ( dbtable 't1' ...)
CREATE OR REPLACE TEMPORARY VIEW t2 USING jdbc OPTIONS ( dbtable 't2' ...)
SELECT * from t1 join t2 on t1.id = t2.id limit 10;
This query will result in a full table select from t1 and t2 in my jdbc source, which isn't great, but understandable given how I have defined the tables. An optimized query would perhaps only need to select 10 rows from the underlying database.
This would work using the scala API (not sure exactly what the spark SQL equivalent is, or if there is any):
spark.read.jdbc("jdbc:...", "(SELECT * from t1 join t2 on t1.id = t2.id limit 10) as t", new java.util.Properties)
However, this method seems cumbersome to use for every query I might want to run on my remote jdbc DB (requires writing a query in a string, and doesn't use spark sql). Ideally, I would want something like defining an entire database using the JDBC source, so that queries using only tables from that source could be entirely pushed down to the underlying database. Does anyone know a better approach to this problem, or even more generally how to have a nicer integration with spark sql and remote database using some other approach or tool?