Best practices for dealing with large no of PDF files

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practices for dealing with large no of PDF files

unk1102
Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing so I
am loosing colors, formatting etc Please guide.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

Nicolas Paris
Hi

Problem is number of files on hadoop;


I deal with 50M pdf files. What I did is to put them in an avro table on hdfs,
as a binary column.

Then I read it with spark and push that into pdfbox.

Transforming 50M pdfs into text took 2hours on a 5 computers clusters

About colors and formating, I guess pdfbox is able to get that information
and then maybe you could add html balise in your txt output.
That's some extra work indeed




2018-04-23 18:25 GMT+02:00 unk1102 <[hidden email]>:
Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing so I
am loosing colors, formatting etc Please guide.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

Deepak Sharma
Is there any open source code base to refer to for this kind of use case ?

Thanks
Deepak

On Mon, Apr 23, 2018, 22:13 Nicolas Paris <[hidden email]> wrote:
Hi

Problem is number of files on hadoop;


I deal with 50M pdf files. What I did is to put them in an avro table on hdfs,
as a binary column.

Then I read it with spark and push that into pdfbox.

Transforming 50M pdfs into text took 2hours on a 5 computers clusters

About colors and formating, I guess pdfbox is able to get that information
and then maybe you could add html balise in your txt output.
That's some extra work indeed




2018-04-23 18:25 GMT+02:00 unk1102 <[hidden email]>:
Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing so I
am loosing colors, formatting etc Please guide.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

unk1102
In reply to this post by Nicolas Paris
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere?
Do your just keep pdf in avro binary all the time? How often you parse into
text using pdfbox? Is it on demand basis or you always parse as text and
keep pdf as binary in avro as just interim state?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

Nicolas Paris
2018-04-23 18:59 GMT+02:00 unk1102 <[hidden email]>:
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere?

​I have some open-source code. I could find time to push on github if needed.​

 
Do your just keep pdf in avro binary all the time?

​yes, I store them. Actually, I did that one time for 50M pdf, and the daily 100K and each run is
archived on hdfs so that I can query them with hive in a table with multiple avro files ​

 
How often you parse into
text using pdfbox?

​Each time I improve my pdfbox extractor program. say...one time a year maybe ​

 
Is it on demand basis or you always parse as text and
keep pdf as binary in avro as just interim state?


​Can be both.  Also, I store them into an orc file for an other use case with a webservice
on top of that to share the pdfs. That table is 4TO and contains 50M pdfs. It gets MERGED
every day with the new 100K pdf, thanks to HIVE merge and ORC acid capabilities​




Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

unk1102
Hi Nicolas thanks much for guidance it was very useful information if you can
push that code to github and share url it would be a great help. Looking
forward. If you can find time to push early it would be even greater help as
I have to finish POC on this use case ASAP.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

Deepak Sharma
Yes Nicolas.
It would be great hell if you can push code to github and share URL.

Thanks
Deepak

On Mon, Apr 23, 2018, 23:00 unk1102 <[hidden email]> wrote:
Hi Nicolas thanks much for guidance it was very useful information if you can
push that code to github and share url it would be a great help. Looking
forward. If you can find time to push early it would be even greater help as
I have to finish POC on this use case ASAP.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

Nicolas Paris
sure then let me recap steps:
1. load pdfs in a local folder to hdfs avro
2. load avro in spark as a RDD
3. apply pdfbox to each csv and return content as string
4. write the result as a huge csv file

That's some work guys for me to push all that. Should find some time however within 7 days

@unk1102: this won't cover the colors and formatting aspects then you could play with pdfbox until I release
the other parts

Cheers

2018-04-23 19:34 GMT+02:00 Deepak Sharma <[hidden email]>:
Yes Nicolas.
It would be great hell if you can push code to github and share URL.

Thanks
Deepak


On Mon, Apr 23, 2018, 23:00 unk1102 <[hidden email]> wrote:
Hi Nicolas thanks much for guidance it was very useful information if you can
push that code to github and share url it would be a great help. Looking
forward. If you can find time to push early it would be even greater help as
I have to finish POC on this use case ASAP.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

unk1102
Thanks much Nicolas really appreciate it.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Best practices for dealing with large no of PDF files

Nicolas Paris
guys
Please add issues if any questions or improvement ideas

Enjoy

Cheers

2018-04-23 20:42 GMT+02:00 unk1102 <[hidden email]>:
Thanks much Nicolas really appreciate it.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]