Unit testing Spark/Scala code with Mockito

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Unit testing Spark/Scala code with Mockito

Mich Talebzadeh
Hi,

I have a spark job that reads an XML file from HDFS, process it and port data to Hive tables, one good and one exception table

The Code itself works fine. I need to create Unit Test with Mockito for it.. A unit test should test functionality in isolation. Side effects from other classes or the system should be eliminated for a unit test, if possible. So basically there are three classes.

  1. Class A, reads XML file and created a DF1 on it plus a DF2 on top of DF1. Test data for XML file is already created 
  2. Class B, reads DF2 and post correct data through TempView and Spark SQL to the underlying Hive table
  3. Class C, read DF2 and post exception data again through TempView and Spark SQL to the underlying Hive exception table 
I would like to know for cases covering tests for Class B and Class C what Mockito format needs to be used..

Thanks,

Mich




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Unit testing Spark/Scala code with Mockito

Mich Talebzadeh
On a second note with regard Spark and read writes as I understand unit tests are not meant to test database connections. This should be done in integration tests to check that all the parts work together. Unit tests are just meant to test the functional logic, and not spark's ability to read from a database.

I would have thought that if the specific connectivity through third part tool (in my case reading XML file using Databricks jar) is required, then this should be done through Read Evaluate Print Loop – REPL environment of Spark Shell by writing some codec to quickly establish where the API successfully reads from the XML file.

Does this assertion sound correct?

thanks,

Mich



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 20 May 2020 at 11:58, Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a spark job that reads an XML file from HDFS, process it and port data to Hive tables, one good and one exception table

The Code itself works fine. I need to create Unit Test with Mockito for it.. A unit test should test functionality in isolation. Side effects from other classes or the system should be eliminated for a unit test, if possible. So basically there are three classes.

  1. Class A, reads XML file and created a DF1 on it plus a DF2 on top of DF1. Test data for XML file is already created 
  2. Class B, reads DF2 and post correct data through TempView and Spark SQL to the underlying Hive table
  3. Class C, read DF2 and post exception data again through TempView and Spark SQL to the underlying Hive exception table 
I would like to know for cases covering tests for Class B and Class C what Mockito format needs to be used..

Thanks,

Mich




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Unit testing Spark/Scala code with Mockito

ZHANG Wei
AFAICT, depends on testing goals, Unit Test, Integration Test or E2E
Test.

For Unit Test, mostly, it tests individual class or class methods.
Mockito can help mock and verify dependent instances or methods.

For Integration Test, some Spark testing helper methods can setup the
environment, such as `runInterpreter`[1] for running codes in REPL. The
data source can be mocked by `Seq(...).toDS()` or reading a local file,
no need to access Hive service.

For E2E Test, the HDFS and Hive (normally, a local mini version) have
to be setup to service the real operations from Spark.

Just my 2 cents.

--
Cheers,
-z
[1] https://github.com/apache/spark/blob/a06768ec4d5059d1037086fe5495e5d23cde514b/repl/src/test/scala/org/apache/spark/repl/ReplSuite.scala#L49

On Wed, 20 May 2020 15:36:06 +0100
Mich Talebzadeh <[hidden email]> wrote:

> On a second note with regard Spark and read writes as I understand unit
> tests are not meant to test database connections. This should be done in
> integration tests to check that all the parts work together. Unit tests are
> just meant to test the functional logic, and not spark's ability to read
> from a database.
>
> I would have thought that if the specific connectivity through third part
> tool (in my case reading XML file using Databricks jar) is required, then
> this should be done through Read Evaluate Print Loop – REPL environment of
> Spark Shell by writing some codec to quickly establish where the API
> successfully reads from the XML file.
>
> Does this assertion sound correct?
>
> thanks,
>
> Mich
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 20 May 2020 at 11:58, Mich Talebzadeh <[hidden email]>
> wrote:
>
> > Hi,
> >
> > I have a spark job that reads an XML file from HDFS, process it and port
> > data to Hive tables, one good and one exception table
> >
> > The Code itself works fine. I need to create Unit Test with Mockito
> > <https://www.vogella.com/tutorials/Mockito/article.html>for it.. A unit
> > test should test functionality in isolation. Side effects from other
> > classes or the system should be eliminated for a unit test, if possible. So
> > basically there are three classes.
> >
> >
> >    1. Class A, reads XML file and created a DF1 on it plus a DF2 on top
> >    of DF1. Test data for XML file is already created
> >    2. Class B, reads DF2 and post correct data through TempView and Spark
> >    SQL to the underlying Hive table
> >    3. Class C, read DF2 and post exception data again through TempView
> >    and Spark SQL to the underlying Hive exception table
> >
> > I would like to know for cases covering tests for Class B and Class C what
> > Mockito format needs to be used..
> >
> > Thanks,
> >
> > Mich
> >
> >
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising from
> > such loss, damage or destruction.
> >
> >
> >

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]