apache-spark mongodb dataframe issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

apache-spark mongodb dataframe issue

Mannat Singh
Reply | Threaded
Open this post in threaded view
|

Re: apache-spark mongodb dataframe issue

Jeff Evans
As far as I know, in general, there isn't a way to distinguish explicit null values from missing ones.  (Someone please correct me if I'm wrong, since I would love to be able to do this for my own reasons).  If you really must do it, and don't care about performance at all (since it will be horrible), read each object as a separate batch, while inferring the schema.  If the schema contains the column, but the value is null, you will know it was explicitly set that way.  If the schema doesn't contain the column, you'll know it was missing.

On Tue, Jun 23, 2020 at 7:34 AM Harmanat Singh <[hidden email]> wrote:
Reply | Threaded
Open this post in threaded view
|

Re: apache-spark mongodb dataframe issue

Mannat Singh
Hi Jeff
Thanks for confirming the same.

I have also thought about reading every MongoDB document separately along
with their schemas and then comparing them to the schemas of all the
documents in the collection. For our huge database this is a horrible
horrible approach as you have already mentioned.

I am doing RnD on another approach, will post here if there is a
breakthrough.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]