pyspark + from_json(col("col_name"), schema) returns all null

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

pyspark + from_json(col("col_name"), schema) returns all null

salemi
This post was updated on .
Hi All,

I am using pyspark and consuming messages from Kafka. I convert the incoming messages to json and bind it to a column called decoded_data.
--------------------
| decoded_data   |
--------------------
| json message1  |
--------------------
| json message2  |
--------------------

On the stream I run the following query to explode the json messages using a StructType schema as followed:

.select(from_json(col("decoded_data"), schema).alias("table")).select("table.*")

The schema get applied  but all the rows values are null

-------------------------------------------------------------
| server_info | info_a | info_b | info_c | info d | info e |
-------------------------------------------------------------
| null            | null     | null    |  null    | null   | null    |
-------------------------------------------------------------
| null            | null     | null    |  null    | null   | null    |
-------------------------------------------------------------
| null            | null     | null    |  null    | null   | null    |
-------------------------------------------------------------


The JSON file looks like the following with more attributes:

{
   "server_info" : null,
   "info_a"       :  "value a",
   "info_b"       :  "(value/b a)",
   "info_c"       :  "10.10.10.10",
   "info_d"       :  null,
   "info_e"       :  10
}

The StructType is
schema = StructType() \
     .add("server_info", StringType()) \
     .add("info_a", StringType()) \
     .add("info_b", StringType()) \
     .add("info_c", StringType()) \
     .add("info_d", StringType()) \
     .add("info_e", IntegerType())


Any ideas what might be wrong? How do I debug this?

Thanks,
Ali
Reply | Threaded
Open this post in threaded view
|

Re: pyspark + from_json(col("col_name"), schema) returns all null

salemi
I found the root cause! There was mismatch between the StructField type and
the json message.


Is there a good write up / wiki out there that describes how to debug spark
jobs?


Thanks





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pyspark + from_json(col("col_name"), schema) returns all null

Jacek Laskowski
Hi,

Not that I'm aware of, but in your case checking out whether a JSON message fit your schema and the pipeline would've taken pyspark alone with JSONs on disk, wouldn't it?

Pozdrawiam,
Jacek Laskowski
----
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark

On Mon, Dec 11, 2017 at 12:49 AM, salemi <[hidden email]> wrote:
I found the root cause! There was mismatch between the StructField type and
the json message.


Is there a good write up / wiki out there that describes how to debug spark
jobs?


Thanks





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]