[Spark SQL] [pyspark.sql]: Potential bug in toDF using nested structures

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Spark SQL] [pyspark.sql]: Potential bug in toDF using nested structures

msachdev
This post has NOT been accepted by the mailing list yet.
Hi, 

I am trying to create a DF from a Python dictionary and encountered an issue where some of the nested fields are being returned as None (on collect). I have created a sample here with the output: https://gist.github.com/sachdevm/04c27ec91adbe2fdbe5969f4af723642

The sample contains two snippets -- one which exhibits the stated issue and another which works correctly. My suspicion is that when parsing nested dictionary objects in the Row class, the datatype for all values is being incorrectly set to that of the first key encountered (in the above example "duration") and when the conversion fails, it is being set as None. In the second example in the gist, all values in the nested dictionary are strings and all data is preserved correctly. 

I am using version 2.1.0: 
>>> print pyspark.__version__
2.1.0


Please let me know if I am missing something or there some issue in the code sample itself.  

Thanks, 
Manish
Loading...