Best practice for retrieving big data from RDD to local machine

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practice for retrieving big data from RDD to local machine

Egor Pahomov
Hello. I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in result RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide. (http://stackoverflow.com/questions/21698443/best-practice-for-retrieving-big-data-from-rdd-to-local-machine)

--
Sincerely yours
Egor Pakhomov
Scala Developer, Yandex
Reply | Threaded
Open this post in threaded view
|

Re: Best practice for retrieving big data from RDD to local machine

Andrew Ash
Hi Egor,

It sounds like you should vote forĀ https://spark-project.atlassian.net/browse/SPARK-914 which is to make an RDD iterable from the driver.


On Wed, Feb 12, 2014 at 1:07 AM, Egor Pahomov <[hidden email]> wrote:
Hello. I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in result RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide. (http://stackoverflow.com/questions/21698443/best-practice-for-retrieving-big-data-from-rdd-to-local-machine)

--
Sincerely yours
Egor Pakhomov
Scala Developer, Yandex