Rishi's approach will work, but its worth mentioning that because all of the data goes into only two groups, you will only process the resulting data with two tasks and so you're losing almost all parallelism. Presumably you're processing a lot of data, since you only want to do one pass, so I doubt that would actually be helpful.
Unfortunately I don't think there is a better approach than doing two passes currently. Given some more info about the downstream processes, there may be alternatives, but in general I think you are stuck.
Eg., here's a slight variation on Rishi's proposal, that may or may not work:
which splits the data by a compound key -- first just a label of whether or not it matches, and then subdivides into another 500 groups. This will result in nicely balanced tasks within each group, but also results in a shuffle of all the data, which can be pretty expensive. You might be better off just doing two passes over the raw data.