Dataset API inconsistencies

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataset API inconsistencies

Alex Nastetsky
I am finding using the Dataset API to be very cumbersome to use, which is unfortunate, as I was looking forward to the type-safety after coming from a Dataframe codebase.


The problem is having to continuously switch back and forth between typed and untyped semantics, which really kills productivity. In contrast, the RDD API is consistently typed and the Dataframe API is consistently untyped. I don't have to continuously stop and think about which one to use for each operation.

I gave the Frameless framework (mentioned in the link) a shot, but eventually started running into oddities and lack of enough documentation and community support and did not want to sink too much time into it.

At this point I'm considering just sticking with Dataframes, as I don't really consider Datasets to be usable. Has anyone had a similar experience or has had better luck?

Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Dataset API inconsistencies

Michael Armbrust
I wrote Datasets, and I'll say I only use them when I really need to (i.e. when it would be very cumbersome to express what I am trying to do relationally).  Dataset operations are almost always going to be slower than their DataFrame equivalents since they usually require materializing objects (where as DataFrame operations usually generate code that operates directly on binary encoded data).

We certainly could flesh out the API further (e.g. add orderBy that takes a lambda function), but so far I have not seen a lot of demand for this, and it would be strictly slower than the DataFrame version. I worry this wouldn't actually be beneficial to users as it would give them a choice that looks the same but has performance implications that are non-obvious. If I'm in the minority though with this opinion, we should do it.

Regarding the concerns about type-safety, I haven't really found that to be a major issue.  Even though you don't have type safety from the scala compiler, the Spark SQL analyzer is checking your query before any execution begins. This opinion is perhaps biased by the fact that I do a lot of Spark SQL programming in notebooks where the difference between "compile-time" and "runtime" is pretty minimal.

On Wed, Jan 10, 2018 at 1:45 AM, Alex Nastetsky <[hidden email]> wrote:
I am finding using the Dataset API to be very cumbersome to use, which is unfortunate, as I was looking forward to the type-safety after coming from a Dataframe codebase.


The problem is having to continuously switch back and forth between typed and untyped semantics, which really kills productivity. In contrast, the RDD API is consistently typed and the Dataframe API is consistently untyped. I don't have to continuously stop and think about which one to use for each operation.

I gave the Frameless framework (mentioned in the link) a shot, but eventually started running into oddities and lack of enough documentation and community support and did not want to sink too much time into it.

At this point I'm considering just sticking with Dataframes, as I don't really consider Datasets to be usable. Has anyone had a similar experience or has had better luck?

Alex.