[Spark SQL] [pyspark.sql]: Potential bug in toDF using nested structures

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

[Spark SQL] [pyspark.sql]: Potential bug in toDF using nested structures

This post has NOT been accepted by the mailing list yet.

I am trying to create a DF from a Python dictionary and encountered an issue where some of the nested fields are being returned as None (on collect). I have created a sample here with the output: https://gist.github.com/sachdevm/04c27ec91adbe2fdbe5969f4af723642

The sample contains two snippets -- one which exhibits the stated issue and another which works correctly. My suspicion is that when parsing nested dictionary objects in the Row class, the datatype for all values is being incorrectly set to that of the first key encountered (in the above example "duration") and when the conversion fails, it is being set as None. In the second example in the gist, all values in the nested dictionary are strings and all data is preserved correctly. 

I am using version 2.1.0: 
>>> print pyspark.__version__

Please let me know if I am missing something or there some issue in the code sample itself.