Text from pdf spark

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Text from pdf spark

Joel D

I'm trying to extract text from pdf files in hdfs using pdfBox. 

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...

java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 

(No such file or directory)"




What am I missing? Should I be working with PortableDataStream instead of the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = {

val file: File = new File(fileNameFromRDD._1.drop(5))

val document = PDDocument.load(file); //It throws an error here.


if (!document.isEncrypted()) {

  val stripper = new PDFTextStripper()

  val text = stripper.getText(document)

  println("Text:" + text)


}

    document.close()


  }


//This is where I call the above pdf to text converter method.

     val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")

    files.foreach(println)


    files.foreach(f => println(f._1))


    files.foreach(fileStream => pdfRead(fileStream, sparkSession))


Thanks.







Reply | Threaded
Open this post in threaded view
|

Re: Text from pdf spark

kathleen li
The error message is “file not found”
Are you able to use the following command line to assess the file with the user you submitted the job?
hdfs dfs -ls /tmp/sample.pdf

Sent from my iPhone

On Sep 28, 2018, at 12:10 PM, Joel D <[hidden email]> wrote:

I'm trying to extract text from pdf files in hdfs using pdfBox. 

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...

java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 

(No such file or directory)"




What am I missing? Should I be working with PortableDataStream instead of the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = {

val file: File = new File(fileNameFromRDD._1.drop(5))

val document = PDDocument.load(file); //It throws an error here.


if (!document.isEncrypted()) {

  val stripper = new PDFTextStripper()

  val text = stripper.getText(document)

  println("Text:" + text)


}

    document.close()


  }


//This is where I call the above pdf to text converter method.

     val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")

    files.foreach(println)


    files.foreach(f => println(f._1))


    files.foreach(fileStream => pdfRead(fileStream, sparkSession))


Thanks.







Reply | Threaded
Open this post in threaded view
|

Re: Text from pdf spark

Joel D
Yes, I can access the file using cli. 

On Fri, Sep 28, 2018 at 1:24 PM kathleen li <[hidden email]> wrote:
The error message is “file not found”
Are you able to use the following command line to assess the file with the user you submitted the job?
hdfs dfs -ls /tmp/sample.pdf

Sent from my iPhone

On Sep 28, 2018, at 12:10 PM, Joel D <[hidden email]> wrote:

I'm trying to extract text from pdf files in hdfs using pdfBox. 

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...

java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 

(No such file or directory)"




What am I missing? Should I be working with PortableDataStream instead of the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = {

val file: File = new File(fileNameFromRDD._1.drop(5))

val document = PDDocument.load(file); //It throws an error here.


if (!document.isEncrypted()) {

  val stripper = new PDFTextStripper()

  val text = stripper.getText(document)

  println("Text:" + text)


}

    document.close()


  }


//This is where I call the above pdf to text converter method.

     val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")

    files.foreach(println)


    files.foreach(f => println(f._1))


    files.foreach(fileStream => pdfRead(fileStream, sparkSession))


Thanks.