Mock spark reads and writes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Mock spark reads and writes

Dark Crusader
Sorry I wasn't very clear in my last email.

I have a function like this:

def main( read_file):
    df = spark.read.csv(read_file)
    ****** Some other code ******
    df.write.csv(path)

Which I need to write a unit test for.
Would pythons unittest mock help me here?

When I googled this, I mostly see that we shouldn't mock these reads and writes, but this doesn't solve the problem of how I unittest helper functions/main method that will have to read and write files.

An example of the proper way to do this in python would be really helpful.

Thanks a lot.
ed
Reply | Threaded
Open this post in threaded view
|

Re: Mock spark reads and writes

ed
Hi,

For testing things like this you have a couple of options, you could isolate all your business logic separately from your read/write/spark code which, in my experience, makes the code harder to write and manage.

The other option is to accept that tests will be slower than you would expect unit tests to be normally and actually allow the read/write to happen on a local instance of spark.

Your tests then become:

- Write the data you need for the test
- spark-submit the test (or something similar)
- Check the results of the test

If you have any truly isolated business logic then you can unit test that as you would do normally but it is likely that most spark jobs are going to call spark functions which you either mock out (and if a spark function calls in a mocked out forest, does anyone here it fail?) or you allow it and take the performance hit.

Personally, I have used both approaches and I would favour the second approach of allowing reads and writes to happen on a local spark instance as it  tells you so much more than just whether functions were called in the right order and with the right parameters.



Ed

On 14 Jul 2020, at 18:18, Dark Crusader <[hidden email]> wrote:

Sorry I wasn't very clear in my last email.

I have a function like this:

def main( read_file):
    df = spark.read.csv(read_file)
    ****** Some other code ******
    df.write.csv(path)

Which I need to write a unit test for.
Would pythons unittest mock help me here?

When I googled this, I mostly see that we shouldn't mock these reads and writes, but this doesn't solve the problem of how I unittest helper functions/main method that will have to read and write files.

An example of the proper way to do this in python would be really helpful.

Thanks a lot.

Reply | Threaded
Open this post in threaded view
|

Re: Mock spark reads and writes

Jeff Evans
In reply to this post by Dark Crusader
Why do you need to mock the read/write at all?  Why not have your test CSV file, and invoke it (which will perform the real Spark DF read of CSV), write it, and assert on the output?

On Tue, Jul 14, 2020 at 12:19 PM Dark Crusader <[hidden email]> wrote:
Sorry I wasn't very clear in my last email.

I have a function like this:

def main( read_file):
    df = spark.read.csv(read_file)
    ****** Some other code ******
    df.write.csv(path)

Which I need to write a unit test for.
Would pythons unittest mock help me here?

When I googled this, I mostly see that we shouldn't mock these reads and writes, but this doesn't solve the problem of how I unittest helper functions/main method that will have to read and write files.

An example of the proper way to do this in python would be really helpful.

Thanks a lot.