|
|
Hi,
I am reading MongoDB collection into spark with Scala.
In general it works.
import com.mongodb.spark._ val rdd = MongoSpark.load(sc) val inventory = rdd.toDF scala> inventory.printSchema root |-- Audience: array (nullable = true) | |-- element: string (containsNull = true) |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- column1: string (nullable = true) |-- column2: string (nullable = true) |-- column3: string (nullable = true) |-- item: string (nullable = true) |-- qty: double (nullable = true) |-- size: struct (nullable = true) | |-- h: double (nullable = true) | |-- w: double (nullable = true) | |-- uom: string (nullable = true) |-- status: string (nullable = true) |-- tags: array (nullable = true) | |-- element: string (containsNull = true)
A typical collection looks like this
{ "_id" : ObjectId("5b8b9f77b5e7ecfa90c3825a"), "item" : "second new item", "qty" : 2.0, "status" : "A", "size" : { "h" : 20.0, "w" : 51.0, "uom" : "cm" }, "tags" : [ "green", "red" ], "Audience" : [ "Rich", "Powerful" ], "column1" : "final", "column2" : "new", "column3" : "something" }
Regardless whether DF is practical for unstructured data I want to do simple filtering on data
I am trying to filter on column tags which is an array
scala> inventory.filter(col("tags").contains("red")).show org.apache.spark.sql.AnalysisException: cannot resolve 'contains(`tags`, 'red')' due to data type mismatch: argument 1 requires string type, however, '`tags`' is of array<string> type.;; 'Filter Contains(tags#637, red)
I can try to explode the array like below
val tags = inventory.select(explode(inventory("tags"))).toDF
However, is there a way of filtering on element of an array in this case?
Thanks
|
|
Hi,
This is a way to solve it using function (array_contains(...))
scala> inventory.where (array_contains(inventory("tags"),"blue")).show
+--------+--------------------+-------+-------+-------+--------+----+-----------------+------+------+ |Audience| _id|column1|column2|column3| item| qty| size|status| tags| +--------+--------------------+-------+-------+-------+--------+----+-----------------+------+------+ | null|[5b86585f1ef86d8b...| null| null| null|postcard|45.0|[10.0, 15.25, cm]| A|[blue]| +--------+--------------------+-------+-------+-------+--------+----+-----------------+------+------+
Hi,
I am reading MongoDB collection into spark with Scala.
In general it works.
import com.mongodb.spark._ val rdd = MongoSpark.load(sc) val inventory = rdd.toDF scala> inventory.printSchema root |-- Audience: array (nullable = true) | |-- element: string (containsNull = true) |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- column1: string (nullable = true) |-- column2: string (nullable = true) |-- column3: string (nullable = true) |-- item: string (nullable = true) |-- qty: double (nullable = true) |-- size: struct (nullable = true) | |-- h: double (nullable = true) | |-- w: double (nullable = true) | |-- uom: string (nullable = true) |-- status: string (nullable = true) |-- tags: array (nullable = true) | |-- element: string (containsNull = true)
A typical collection looks like this
{ "_id" : ObjectId("5b8b9f77b5e7ecfa90c3825a"), "item" : "second new item", "qty" : 2.0, "status" : "A", "size" : { "h" : 20.0, "w" : 51.0, "uom" : "cm" }, "tags" : [ "green", "red" ], "Audience" : [ "Rich", "Powerful" ], "column1" : "final", "column2" : "new", "column3" : "something" }
Regardless whether DF is practical for unstructured data I want to do simple filtering on data
I am trying to filter on column tags which is an array
scala> inventory.filter(col("tags").contains("red")).show org.apache.spark.sql.AnalysisException: cannot resolve 'contains(`tags`, 'red')' due to data type mismatch: argument 1 requires string type, however, '`tags`' is of array<string> type.;; 'Filter Contains(tags#637, red)
I can try to explode the array like below
val tags = inventory.select(explode(inventory("tags"))).toDF
However, is there a way of filtering on element of an array in this case?
Thanks
|
|