How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

Louis Hust
We  want to extract data from mysql, and calculate in sparksql.
The sql explain like below.


REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string>
                  +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0
                     +- Exchange(coordinator id: 266374831) hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle partition size: 67108864]
                        +- *(19) Project [R_REGIONKEY#203]
                           +- *(19) Filter ((isnotnull(r_name#204) && (r_name#204 = AFRICA)) && isnotnull(r_regionkey#203))
                              +- InMemoryTableScan [R_REGIONKEY#203, r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA), isnotnull(r_regionkey#203)]
                                    +- InMemoryRelation [R_REGIONKEY#203, R_NAME#204, R_COMMENT#205], true, 10000, StorageLevel(disk, memory, 1 replicas)
                                          +- *(1) Scan JDBCRelation(region) [numPartitions=1] [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205] PushedFilters: [], ReadSchema: struct<R_REGIONKEY:int,R_NAME:string,R_COMMENT:string>


As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC table is so big, the all data can not be filled into memory,  OOM occurs. 
If there is some option to make SparkSQL use Disk if memory not enough?
Reply | Threaded
Open this post in threaded view
|

Re: How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

Takeshi Yamamuro
You want to use `Dataset.persist(StorageLevel.MEMORY_AND_DISK)`?

On Thu, Apr 12, 2018 at 1:12 PM, Louis Hust <[hidden email]> wrote:
We  want to extract data from mysql, and calculate in sparksql.
The sql explain like below.


REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string>
                  +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0
                     +- Exchange(coordinator id: 266374831) hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle partition size: 67108864]
                        +- *(19) Project [R_REGIONKEY#203]
                           +- *(19) Filter ((isnotnull(r_name#204) && (r_name#204 = AFRICA)) && isnotnull(r_regionkey#203))
                              +- InMemoryTableScan [R_REGIONKEY#203, r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA), isnotnull(r_regionkey#203)]
                                    +- InMemoryRelation [R_REGIONKEY#203, R_NAME#204, R_COMMENT#205], true, 10000, StorageLevel(disk, memory, 1 replicas)
                                          +- *(1) Scan JDBCRelation(region) [numPartitions=1] [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205] PushedFilters: [], ReadSchema: struct<R_REGIONKEY:int,R_NAME:string,R_COMMENT:string>


As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC table is so big, the all data can not be filled into memory,  OOM occurs. 
If there is some option to make SparkSQL use Disk if memory not enough?



--
---
Takeshi Yamamuro