Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell <[hidden email]> wrote:
> You are using groupByKey() have you thought of an alternative like aggregateByKey() or combineByKey() to reduce shuffling?

I am aware of this indeed. I do have a groupByKey() that is difficult to avoid, but the problem occurs afterwards.

> Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative.

I will try with dynamic allocation off.

> Take a look at the yarn logs as well for the particular executor in question. Executors can have multiple tasks; and will often fail if they have more tasks than available threads.

The trouble is there is nothing significant in the logs (read: that is clear enough for me to understand!). Any special message I could grep for?

> [...] https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
> [...] https://spark.apache.org/docs/latest/hardware-provisioning.html

Thanks for the pointers -- will have a look!


