We are running Spark 2.3 / Python 3.5.2. For a job we run following code (please notice that the input txt files are just a simplified example, in-fact these are large binary files
and sc.binaryFiles(...) runs out of memory loading the content into memory, therefor only the filenames are parallelized and the executors open/read the content):
18/05/02 10:17:18 INFO TaskSchedulerImpl: Removed TaskSet 3.0,
whose tasks have all completed, from pool
-- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated
otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your
own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be
liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt.