In some cases BroadcastNestedLoopJoin is the only viable join method. In your example for instance you are using a non-equi join condition and BNLJ is the only method that works in that case. This is also the reason why you can't disable it using the spark.sql.autoBroadcastJoinThreshold configuration.
Such a plan is generally generated by using a NOT IN (subquery), if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better.
i want to ask a question about broadcast nestloop join? from google i know, that
left outer/semi join and right outer/semi join will use broadcast nestloop.
and in some cases, when the input data is very small, it is suitable to use. so here
how to define the input data very small? what parameter decides the threshold? I just want to disable it ( i found that set spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1 from testdata1 as a where a.key1 not in (select key3 from testdata3) )
explain cost select a.key1 from testdata1 as a where a.key1 not in (select key3 from testdata3);