Stream to Stream joins

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Stream to Stream joins

Hamish Whittal
Hi folks,

I've got a stream coming from Kafka. It has the following schema:

 userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime }

I'm trying to count the number of logins by acctid.

I can do the count fine, but the table only has the acctid and the count. Now I wish to get all the other columns.

I have tried to do this:
origdata = uDF2.select("userdetail.*")
# This gives me all the columns.
# +---+-------+---------+--------------------+
#  | id| acctid|      uid|             logintm|
# +---+-------+---------+--------------------+
#  |014|1075627|curtis.ga|2020-08-24 13:58:...|
#  +---+-------+---------+--------------------+

logins = origdata\
               .withWatermark('logintm', '10 minutes')\
               .groupBy("acctid", "logintm")\
               .count()\
               .select('acctid', 'count')
# This gives me the count of the accountId.

Now I thought I could do this
   alldata = logins.join(origdata, "acctid")

But I get no rows returned. Am I just completely missing the mark here? The original dataframe has all the columns as expected (including the join column) and the count has the join column and the count.

If I can't do this, then how otherwise does one get all the other data WITH the count?

Thanks for your help in advance.

Hamish