How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

kant kodali

I have a Dataset<String> ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[ 
    {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}
    {"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available in Java API

Reply | Threaded
Open this post in threaded view
|

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

matteuan
Hello, 
I think you should use from_json from spark.sql.functions to parse the json string and convert it to a StructType. Afterwards, you can create a new DataSet by selecting the columns you want.

On 7 October 2017 at 09:30, kant kodali <[hidden email]> wrote:

I have a Dataset<String> ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[ 
    {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}
    {"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available in Java API


Reply | Threaded
Open this post in threaded view
|

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

Jules Damji
In reply to this post by kant kodali
You might find these blogs helpful to parse & extract data from complex structures:



Cheers 
Jules


Sent from my iPhone
Pardon the dumb thumb typos :)

On Oct 7, 2017, at 12:30 AM, kant kodali <[hidden email]> wrote:

I have a Dataset<String> ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[ 
    {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}
    {"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available in Java API

Reply | Threaded
Open this post in threaded view
|

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

kant kodali
I have the following so far

private StructType getSchema() {
    return new StructType()
            .add("name", StringType)
            .add("address", StringType)
            .add("docs", StringType);
}

ds.select(explode_outer(from_json(ds.col("value"), ArrayType.apply(getSchema()))).as("result")).selectExpr("result.*");
This didn't quite work for me so just to clarify I have Json array of documents as my input string
and I am trying to keep the values of my name, address, docs columns as a string as well except
my input array string is flattened out by explode function.
Any suggestions will be great
Thanks!


On Sat, Oct 7, 2017 at 10:00 AM, Jules Damji <[hidden email]> wrote:
You might find these blogs helpful to parse & extract data from complex structures:



Cheers 
Jules


Sent from my iPhone
Pardon the dumb thumb typos :)

On Oct 7, 2017, at 12:30 AM, kant kodali <[hidden email]> wrote:

I have a Dataset<String> ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[ 
    {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}
    {"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available in Java API


Reply | Threaded
Open this post in threaded view
|

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

kant kodali

On Sun, Oct 8, 2017 at 11:58 AM, kant kodali <[hidden email]> wrote:
I have the following so far

private StructType getSchema() {
    return new StructType()
            .add("name", StringType)
            .add("address", StringType)
            .add("docs", StringType);
}

ds.select(explode_outer(from_json(ds.col("value"), ArrayType.apply(getSchema()))).as("result")).selectExpr("result.*");
This didn't quite work for me so just to clarify I have Json array of documents as my input string
and I am trying to keep the values of my name, address, docs columns as a string as well except
my input array string is flattened out by explode function.
Any suggestions will be great
Thanks!


On Sat, Oct 7, 2017 at 10:00 AM, Jules Damji <[hidden email]> wrote:
You might find these blogs helpful to parse & extract data from complex structures:



Cheers 
Jules


Sent from my iPhone
Pardon the dumb thumb typos :)

On Oct 7, 2017, at 12:30 AM, kant kodali <[hidden email]> wrote:

I have a Dataset<String> ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[ 
    {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}
    {"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available in Java API



Reply | Threaded
Open this post in threaded view
|

RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

Jean Georges Perrin
In reply to this post by kant kodali

Something along the line of:

 

Dataset<Row> df = spark.read().json(jsonDf); ?

 

 

From: kant kodali [mailto:[hidden email]]
Sent: Saturday, October 07, 2017 2:31 AM
To: user @spark <[hidden email]>
Subject: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

 

I have a Dataset<String> ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[ 
    {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}
    {"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}
 
]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available in Java API

Reply | Threaded
Open this post in threaded view
|

RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

Hien Luu
Hi Kant,

I am not sure whether you had come up with a solution yet, but the following
works for me (in Scala)

val emp_info = """
  [
    {"name": "foo", "address": {"state": "CA", "country": "USA"},
"docs":[{"subject": "english", "year": 2016}]},
    {"name": "bar", "address": {"state": "OH", "country": "USA"},
"docs":[{"subject": "math", "year": 2017}]}
  ]"""

import org.apache.spark.sql.types._

val addressSchema = new StructType().add("state", StringType).add("country",
StringType)
val docsSchema = ArrayType(new StructType().add("subject",
StringType).add("year", IntegerType))
val employeeSchema = new StructType().add("name", StringType).add("address",
addressSchema).add("docs", docsSchema)

val empInfoSchema = ArrayType(employeeSchema)

empInfoSchema.json

val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
empInfoStrDF.printSchema
empInfoStrDF.show(false)

val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
empInfoSchema).as("emp_info"))
empInfoDF.printSchema

empInfoDF.select(struct("*")).show(false)

empInfoDF.select("emp_info.name", "emp_info.address",
"emp_info.docs").show(false)

empInfoDF.select(explode('emp_info.getItem("name"))).show




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

Jules Damji
Here’s are couple tutorial that shows how to extract Structured nested data 


https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore-complex-data-types.html

Sent from my iPhone
Pardon the dumb thumb typos :)

On Jan 6, 2018, at 11:42 AM, Hien Luu <[hidden email]> wrote:

Hi Kant,

I am not sure whether you had come up with a solution yet, but the following
works for me (in Scala)

val emp_info = """
 [
   {"name": "foo", "address": {"state": "CA", "country": "USA"},
"docs":[{"subject": "english", "year": 2016}]},
   {"name": "bar", "address": {"state": "OH", "country": "USA"},
"docs":[{"subject": "math", "year": 2017}]}
 ]"""

import org.apache.spark.sql.types._

val addressSchema = new StructType().add("state", StringType).add("country",
StringType)
val docsSchema = ArrayType(new StructType().add("subject",
StringType).add("year", IntegerType))
val employeeSchema = new StructType().add("name", StringType).add("address",
addressSchema).add("docs", docsSchema)

val empInfoSchema = ArrayType(employeeSchema)

empInfoSchema.json

val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
empInfoStrDF.printSchema
empInfoStrDF.show(false)

val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
empInfoSchema).as("emp_info"))
empInfoDF.printSchema

empInfoDF.select(struct("*")).show(false)

empInfoDF.select("emp_info.name", "emp_info.address",
"emp_info.docs").show(false)

empInfoDF.select(explode('emp_info.getItem("name"))).show




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

Selvam Raman
I just followed Hien Luu approach

val empExplode = empInfoStrDF.select(explode(from_json('emp_info_str,
empInfoSchema)).as("emp_info_withexplode"))

empExplode.show(false)
+-------------------------------------------+
|emp_info_withexplode                       |
+-------------------------------------------+
|[foo,[CA,USA],WrappedArray([english,2016])]|
|[bar,[OH,USA],WrappedArray([math,2017])]   |
+-------------------------------------------+

empExplode.select(
$"emp_info_withexplode.name").show(false)

+----+
|name|
+----+
|foo |
|bar |
+----+

empExplode.select(
$"emp_info_withexplode.address.state").show(false)
+-----+
|state|
+-----+
|CA   |
|OH   |
+-----+

empExplode.select(
$"emp_info_withexplode.docs.subject").show(false)
+---------+
|subject  |
+---------+
|[english]|
|[math]   |
+---------+

@Kant kodali, is that helpful for you? if not please let me know what changes are you expecting in this?



On Sun, Jan 7, 2018 at 12:16 AM, Jules Damji <[hidden email]> wrote:
Here’s are couple tutorial that shows how to extract Structured nested data 


https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore-complex-data-types.html

Sent from my iPhone
Pardon the dumb thumb typos :)

On Jan 6, 2018, at 11:42 AM, Hien Luu <[hidden email]> wrote:

Hi Kant,

I am not sure whether you had come up with a solution yet, but the following
works for me (in Scala)

val emp_info = """
 [
   {"name": "foo", "address": {"state": "CA", "country": "USA"},
"docs":[{"subject": "english", "year": 2016}]},
   {"name": "bar", "address": {"state": "OH", "country": "USA"},
"docs":[{"subject": "math", "year": 2017}]}
 ]"""

import org.apache.spark.sql.types._

val addressSchema = new StructType().add("state", StringType).add("country",
StringType)
val docsSchema = ArrayType(new StructType().add("subject",
StringType).add("year", IntegerType))
val employeeSchema = new StructType().add("name", StringType).add("address",
addressSchema).add("docs", docsSchema)

val empInfoSchema = ArrayType(employeeSchema)

empInfoSchema.json

val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
empInfoStrDF.printSchema
empInfoStrDF.show(false)

val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
empInfoSchema).as("emp_info"))
empInfoDF.printSchema

empInfoDF.select(struct("*")).show(false)

empInfoDF.select("emp_info.name", "emp_info.address",
"emp_info.docs").show(false)

empInfoDF.select(explode('emp_info.getItem("name"))).show




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"