Assertion of return value of dataframe in pytest

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Assertion of return value of dataframe in pytest

Mich Talebzadeh
Hi,

In Pytest you want to ensure that the composed DF has the correct return.

Example

    df2 = house_df. \
        select( \
        F.date_format('datetaken', 'yyyy').cast("Integer").alias('YEAR') \
        , 'REGIONNAME' \
        , round(F.avg('averageprice').over(wSpecY)).alias('AVGPRICEPERYEAR') \
        , round(F.avg('flatprice').over(wSpecY)).alias('AVGFLATPRICEPERYEAR') \
        , round(F.avg('TerracedPrice').over(wSpecY)).alias('AVGTERRACEDPRICEPERYEAR') \
        , round(F.avg('SemiDetachedPrice').over(wSpecY)).alias('AVGSDPRICEPRICEPERYEAR') \
        , round(F.avg('DetachedPrice').over(wSpecY)).alias('AVGDETACHEDPRICEPERYEAR')). \
        distinct().orderBy('datetaken', asending=True)

Will that be enough to run just this command

  assert not []

I believe that may be flawed because any error will be assumed to be NOT NULL?

Thanks



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Assertion of return value of dataframe in pytest

Mich Talebzadeh
It appears that the following assertion works assuming that result set can be = 0 (no data) or > 0 there is data

assert df2.count() >= 0

However, if I wanted to write to a JDBC database from PySpark through a function (already defined in another module) as below


def writeTableToOracle(dataFrame,mode,dataset,tableName):

    try:

        dataFrame. \

            write. \

            format("jdbc"). \

            option("url", oracle_url). \

            option("dbtable", tableName). \

            option("user", config['OracleVariables']['oracle_user']). \

            option("password", config['OracleVariables']['oracle_password']). \

            option("driver", config['OracleVariables']['oracle_driver']). \

            mode(mode). \

            save()

    except Exception as e:

        print(f"""{e}, quitting""")

        sys.exit(1)


and call it in the program


from sparkutils import sparkstuff as s

s.writeTableToOracle(df2,"overwrite",config['OracleVariables']['dbschema'],config['OracleVariables']['yearlyAveragePricesAllTable'])


How can one assert its validity in PyTest?


Thanks again


On Wed, 3 Feb 2021 at 15:12, Mich Talebzadeh <[hidden email]> wrote:
Hi,

In Pytest you want to ensure that the composed DF has the correct return.

Example

    df2 = house_df. \
        select( \
        F.date_format('datetaken', 'yyyy').cast("Integer").alias('YEAR') \
        , 'REGIONNAME' \
        , round(F.avg('averageprice').over(wSpecY)).alias('AVGPRICEPERYEAR') \
        , round(F.avg('flatprice').over(wSpecY)).alias('AVGFLATPRICEPERYEAR') \
        , round(F.avg('TerracedPrice').over(wSpecY)).alias('AVGTERRACEDPRICEPERYEAR') \
        , round(F.avg('SemiDetachedPrice').over(wSpecY)).alias('AVGSDPRICEPRICEPERYEAR') \
        , round(F.avg('DetachedPrice').over(wSpecY)).alias('AVGDETACHEDPRICEPERYEAR')). \
        distinct().orderBy('datetaken', asending=True)

Will that be enough to run just this command

  assert not []

I believe that may be flawed because any error will be assumed to be NOT NULL?

Thanks



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Assertion of return value of dataframe in pytest

Marco Mistroni
Hello
 my 2cents/./ 
well that will be an integ test to write to a 'dev' database. (which you might pre-populate and clean up after your runs, so you can have repeatable data).
then either you
1 - use normal sql and assert that the values you store in your dataframe are the same as what you get from your sql
2 - surely , as there is a dataframe.write,  there would be also a dataframe.read that you can use?


hth
 Marco



On Wed, Feb 3, 2021 at 4:51 PM Mich Talebzadeh <[hidden email]> wrote:
It appears that the following assertion works assuming that result set can be = 0 (no data) or > 0 there is data

assert df2.count() >= 0

However, if I wanted to write to a JDBC database from PySpark through a function (already defined in another module) as below


def writeTableToOracle(dataFrame,mode,dataset,tableName):

    try:

        dataFrame. \

            write. \

            format("jdbc"). \

            option("url", oracle_url). \

            option("dbtable", tableName). \

            option("user", config['OracleVariables']['oracle_user']). \

            option("password", config['OracleVariables']['oracle_password']). \

            option("driver", config['OracleVariables']['oracle_driver']). \

            mode(mode). \

            save()

    except Exception as e:

        print(f"""{e}, quitting""")

        sys.exit(1)


and call it in the program


from sparkutils import sparkstuff as s

s.writeTableToOracle(df2,"overwrite",config['OracleVariables']['dbschema'],config['OracleVariables']['yearlyAveragePricesAllTable'])


How can one assert its validity in PyTest?


Thanks again


On Wed, 3 Feb 2021 at 15:12, Mich Talebzadeh <[hidden email]> wrote:
Hi,

In Pytest you want to ensure that the composed DF has the correct return.

Example

    df2 = house_df. \
        select( \
        F.date_format('datetaken', 'yyyy').cast("Integer").alias('YEAR') \
        , 'REGIONNAME' \
        , round(F.avg('averageprice').over(wSpecY)).alias('AVGPRICEPERYEAR') \
        , round(F.avg('flatprice').over(wSpecY)).alias('AVGFLATPRICEPERYEAR') \
        , round(F.avg('TerracedPrice').over(wSpecY)).alias('AVGTERRACEDPRICEPERYEAR') \
        , round(F.avg('SemiDetachedPrice').over(wSpecY)).alias('AVGSDPRICEPRICEPERYEAR') \
        , round(F.avg('DetachedPrice').over(wSpecY)).alias('AVGDETACHEDPRICEPERYEAR')). \
        distinct().orderBy('datetaken', asending=True)

Will that be enough to run just this command

  assert not []

I believe that may be flawed because any error will be assumed to be NOT NULL?

Thanks



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Assertion of return value of dataframe in pytest

Mich Talebzadeh
Thanks Marco.

This is an approach
   # Start as we defined the dataframe to write to Oracle
   df2 = house_df. \
        select( \
        F.date_format('datetaken', 'yyyy').cast("Integer").alias('YEAR') \
        , 'REGIONNAME' \
        , round(F.avg('averageprice').over(wSpecY)).alias('AVGPRICEPERYEAR') \
        , round(F.avg('flatprice').over(wSpecY)).alias('AVGFLATPRICEPERYEAR') \
        , round(F.avg('TerracedPrice').over(wSpecY)).alias('AVGTERRACEDPRICEPERYEAR') \
        , round(F.avg('SemiDetachedPrice').over(wSpecY)).alias('AVGSDPRICEPRICEPERYEAR') \
        , round(F.avg('DetachedPrice').over(wSpecY)).alias('AVGDETACHEDPRICEPERYEAR')). \
        distinct().orderBy('datetaken', asending=True)
    assert df2.count() >= 0
   # Write to Oracle table using df2 dataframe with overwrite option
    s.writeTableToOracle(df2, "overwrite", config['OracleVariables']['dbschema'], config['OracleVariables']['yearlyAveragePricesAllTable'])
    # Read data you already wrote to Oracle table indo read_df dataframe
    fullyQualifiedTableName =  config['OracleVariables']['dbschema'] + '.' + config['OracleVariables']['yearlyAveragePricesAllTable']
    read_df = s.loadTableFromOracleJDBC(spark, fullyQualifiedTableName)
    # test that you get the same number of rows as you wrote
    assert df2.subtract(read_df).count() == 0


This passes the test.



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 3 Feb 2021 at 18:12, Sofia’s World <[hidden email]> wrote:
Hello
 my 2cents/./ 
well that will be an integ test to write to a 'dev' database. (which you might pre-populate and clean up after your runs, so you can have repeatable data).
then either you
1 - use normal sql and assert that the values you store in your dataframe are the same as what you get from your sql
2 - surely , as there is a dataframe.write,  there would be also a dataframe.read that you can use?


hth
 Marco



On Wed, Feb 3, 2021 at 4:51 PM Mich Talebzadeh <[hidden email]> wrote:
It appears that the following assertion works assuming that result set can be = 0 (no data) or > 0 there is data

assert df2.count() >= 0

However, if I wanted to write to a JDBC database from PySpark through a function (already defined in another module) as below


def writeTableToOracle(dataFrame,mode,dataset,tableName):

    try:

        dataFrame. \

            write. \

            format("jdbc"). \

            option("url", oracle_url). \

            option("dbtable", tableName). \

            option("user", config['OracleVariables']['oracle_user']). \

            option("password", config['OracleVariables']['oracle_password']). \

            option("driver", config['OracleVariables']['oracle_driver']). \

            mode(mode). \

            save()

    except Exception as e:

        print(f"""{e}, quitting""")

        sys.exit(1)


and call it in the program


from sparkutils import sparkstuff as s

s.writeTableToOracle(df2,"overwrite",config['OracleVariables']['dbschema'],config['OracleVariables']['yearlyAveragePricesAllTable'])


How can one assert its validity in PyTest?


Thanks again


On Wed, 3 Feb 2021 at 15:12, Mich Talebzadeh <[hidden email]> wrote:
Hi,

In Pytest you want to ensure that the composed DF has the correct return.

Example

    df2 = house_df. \
        select( \
        F.date_format('datetaken', 'yyyy').cast("Integer").alias('YEAR') \
        , 'REGIONNAME' \
        , round(F.avg('averageprice').over(wSpecY)).alias('AVGPRICEPERYEAR') \
        , round(F.avg('flatprice').over(wSpecY)).alias('AVGFLATPRICEPERYEAR') \
        , round(F.avg('TerracedPrice').over(wSpecY)).alias('AVGTERRACEDPRICEPERYEAR') \
        , round(F.avg('SemiDetachedPrice').over(wSpecY)).alias('AVGSDPRICEPRICEPERYEAR') \
        , round(F.avg('DetachedPrice').over(wSpecY)).alias('AVGDETACHEDPRICEPERYEAR')). \
        distinct().orderBy('datetaken', asending=True)

Will that be enough to run just this command

  assert not []

I believe that may be flawed because any error will be assumed to be NOT NULL?

Thanks



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.