Process Million Binary Files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Process Million Binary Files

Joel D
Hi,

I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues:

1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 
2. It took 2276 seconds and that means for millions of files it will take ages to complete. I’m also expecting it to fail for million records with some timeout or gc overhead exception.

Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache

Val fileContentRdd = files.map(file => myFunc(file)



Do you have any guidance on how I can process millions of files using binaryFiles api?

How can I increase the number of tasks/parallelism during the creation of files rdd?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Process Million Binary Files

Jörn Franke
I believe your use case can be better covered with an own data source reading PDF files.

 On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your performance problems (not necessarily the parallelism). You would need to make 1 mio requests to the namenode (this could be also interpreted as a Denial-of-Service attack). Historically, Hadoop Archives were introduced to address this problem: https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html

You can try also to store them first in Hbase or in the future on Hadoop Ozone. That could make a higher parallelism possible „out of the box“. 

Am 10.10.2018 um 23:56 schrieb Joel D <[hidden email]>:

Hi,

I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues:

1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 
2. It took 2276 seconds and that means for millions of files it will take ages to complete. I’m also expecting it to fail for million records with some timeout or gc overhead exception.

Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache

Val fileContentRdd = files.map(file => myFunc(file)



Do you have any guidance on how I can process millions of files using binaryFiles api?

How can I increase the number of tasks/parallelism during the creation of files rdd?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Process Million Binary Files

Nicolas Paris-2
In reply to this post by Joel D
Hi Joel

I built such pipeline to transform pdf-> text
https://github.com/EDS-APHP/SparkPdfExtractor
You can take a look

It transforms 20M pdfs in 2 hours on a 5 node spark cluster

Le 2018-10-10 23:56, Joel D a écrit :

> Hi,
>
> I need to process millions of PDFs in hdfs using spark. First I’m
> trying with some 40k files. I’m using binaryFiles api with which
> I’m facing couple of issues:
>
> 1. It creates only 4 tasks and I can’t seem to increase the
> parallelism there.
> 2. It took 2276 seconds and that means for millions of files it will
> take ages to complete. I’m also expecting it to fail for million
> records with some timeout or gc overhead exception.
>
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
>
> Val fileContentRdd = files.map(file => myFunc(file)
>
> Do you have any guidance on how I can process millions of files using
> binaryFiles api?
>
> How can I increase the number of tasks/parallelism during the creation
> of files rdd?
>
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]