[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

Daniel Mateus Pires
I've been trying to figure out this one for some time now, I have JSONs representing Products coming (physically) partitioned by Brand and I would like to create a DataFrame from the JSON but also keep the partitioning information (Brand)

```
case class Product(brand: String, value: String)
val df = spark.createDataFrame(Seq(Product("something", """{"a": "b", "c": "d"}""")))
df.write.partitionBy("brand").mode("overwrite").json("/tmp/products5/")
val df2 = spark.read.json("/tmp/products5/")

df2.show
/*
+--------------------+------+
|               value|brand|
+--------------------+------+
|{"a": "b", "c": "d"}|  something|
+--------------------+------+
*/


// This is simple and effective but it gets rid of the brand!
spark.read.json(df2.select("value").as[String]).show
/*
+---+---+
|  a|  c|
+---+---+
|  b|  d|
+---+---+
*/
```

Ideally I'd like something similar to spark.read.json that would keep the partitioning values and merge it with the rest of the DataFrame

End result I would like:
```
/*
+---+---+---+
|  a|  c| brand|
+---+---+---+
|  b|  d| something|
+---+---+---+
*/
```

Best regards,
Daniel Mateus Pires
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]