Spark SQL incorrect result on GROUP BY query

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark SQL incorrect result on GROUP BY query

Pei-Lun Lee
Hi,

I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results.
To reproduce, type the following commands in spark-shell connecting to a standalone server:

case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))
sc.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect

the result will be something random like:
res1: Array[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], [4,56], [1,1])

and if I run the same query again, the new result will be correct:
sql("select k,count(*) from foo group by k").collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

Should I file a bug?

--
Pei-Lun Lee
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL incorrect result on GROUP BY query

alise
This post has NOT been accepted by the mailing list yet.
I have meet the same problem , you can try this

case class Foo(k: String, v: Int) => case class Foo(v: Int, k: String)
.
.
.
sql("select k,count(*) from foo group by k").collect
.
.
.
the result will be right
Reply | Threaded
Open this post in threaded view
|

RE: Spark SQL incorrect result on GROUP BY query

haocheng
In reply to this post by Pei-Lun Lee

That’s a good catch, but I think it’s suggested to use HiveContext currently.  ( https://github.com/apache/spark/tree/master/sql)

 

Catalyst$> sbt/sbt hive/console

case class Foo(k: String, v: Int)

val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))

sparkContext.makeRDD(rows).registerAsTable("foo")

sql("select k,count(*) from foo group by k").collect

res1: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

 

Cheng Hao

From: Pei-Lun Lee [mailto:[hidden email]]
Sent: Wednesday, June 11, 2014 6:01 PM
To: [hidden email]
Subject: Spark SQL incorrect result on GROUP BY query

 

Hi,

 

I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results.

To reproduce, type the following commands in spark-shell connecting to a standalone server:

 

case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))
sc.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect

 

the result will be something random like:

res1: Array[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], [4,56], [1,1])

 

and if I run the same query again, the new result will be correct:

sql("select k,count(*) from foo group by k").collect

res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

 

Should I file a bug?

 

--

Pei-Lun Lee

Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL incorrect result on GROUP BY query

Michael Armbrust
In reply to this post by Pei-Lun Lee
I'd try rerunning with master.  It is likely you are running into SPARK-1994.

Michael


On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee <[hidden email]> wrote:
Hi,

I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results.
To reproduce, type the following commands in spark-shell connecting to a standalone server:

case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))
sc.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect

the result will be something random like:
res1: Array[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], [4,56], [1,1])

and if I run the same query again, the new result will be correct:
sql("select k,count(*) from foo group by k").collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

Should I file a bug?

--
Pei-Lun Lee

Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL incorrect result on GROUP BY query

Pei-Lun Lee
I reran with master and looks like it is fixed.



2014-06-12 1:26 GMT+08:00 Michael Armbrust <[hidden email]>:
I'd try rerunning with master.  It is likely you are running into SPARK-1994.

Michael


On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee <[hidden email]> wrote:
Hi,

I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results.
To reproduce, type the following commands in spark-shell connecting to a standalone server:

case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))
sc.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect

the result will be something random like:
res1: Array[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], [4,56], [1,1])

and if I run the same query again, the new result will be correct:
sql("select k,count(*) from foo group by k").collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

Should I file a bug?

--
Pei-Lun Lee


Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL incorrect result on GROUP BY query

Michael Armbrust
Thanks for verifying!


On Thu, Jun 12, 2014 at 12:28 AM, Pei-Lun Lee <[hidden email]> wrote:
I reran with master and looks like it is fixed.



2014-06-12 1:26 GMT+08:00 Michael Armbrust <[hidden email]>:

I'd try rerunning with master.  It is likely you are running into SPARK-1994.

Michael


On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee <[hidden email]> wrote:
Hi,

I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results.
To reproduce, type the following commands in spark-shell connecting to a standalone server:

case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ List.fill(300)(Foo("c", 3))
sc.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect

the result will be something random like:
res1: Array[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], [4,56], [1,1])

and if I run the same query again, the new result will be correct:
sql("select k,count(*) from foo group by k").collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

Should I file a bug?

--
Pei-Lun Lee