Fwd: Spark API and immutability

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Spark API and immutability

Chris Thomas

The cache() method on the DataFrame API caught me out.

Having learnt that DataFrames are built on RDDs and that RDDs are immutable, when I saw the statement df.cache() in our codebase I thought ‘This must be a bug, the result is not assigned, the statement will have no affect.’

However, I’ve since learnt that the cache method actually mutates the DataFrame object*. The statement was valid after all. 

I understand that the underlying user data is immutable, but doesn’t mutating the DataFrame object make the API a little inconsistent and harder to reason about?
 
Regards 

Chris


* (as does persist and rdd.setName methods. I expect there are others) 
Reply | Threaded
Open this post in threaded view
|

Re: Spark API and immutability

Holden Karau
So even on RDDs cache/persist mutate the RDD object. The important thing for Spark is that the data  represented/in the RDD/Dataframe isn’t mutated.

On Mon, May 25, 2020 at 10:56 AM Chris Thomas <[hidden email]> wrote:

The cache() method on the DataFrame API caught me out.

Having learnt that DataFrames are built on RDDs and that RDDs are immutable, when I saw the statement df.cache() in our codebase I thought ‘This must be a bug, the result is not assigned, the statement will have no affect.’

However, I’ve since learnt that the cache method actually mutates the DataFrame object*. The statement was valid after all. 

I understand that the underlying user data is immutable, but doesn’t mutating the DataFrame object make the API a little inconsistent and harder to reason about?
 
Regards 

Chris


* (as does persist and rdd.setName methods. I expect there are others) 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9