some problems about shark on spark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

some problems about shark on spark
How could I set the param MEMORY_ONLY_SER 、Spark.kryoserializer.buffer.mb 、 Spark.default.parallelism and Spark.worker.timeout
when I run a shark query ?
May I set other params in or hive-site.xml instead ?
or set name=value in the shark cli ?
I have a shark query test :
table a 38b ; table b 23b ;
sql: select a.* , b.* from a join b on = ;
it build three stages :
stage1 has tow tasks:
task1: rdd.HadoopRDD : input split table a 0+19 ;
task2: rdd.HadoopRDD : input split table a 19+19;
stage2 has two tasks: 
task1: rdd.HadoopRDD : input split table b 0+11 ;
task2: rdd.HadoopRDD : input split table b 11+12;
stage3 has one task:
task1: just fetch map outputs for shuffle and write to hdfs path .
Why these tables so small , but build two tasks to read it ?
How could I control the reduce task nums in shark ? It seems compute by the biggest father RDD's partitions ?