How to access line fileName in loading file using the textFile method

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to access line fileName in loading file using the textFile method

Soheil Pourbafrani
Hi, My text data are in the form of text file. In the processing logic, I need to know each word is from which file. Actually, I need to tokenize the words and create the pair of <fileName, word>. The naive solution is to call sc.textFile for each file and having the fileName in a variable, create the pairs, but it's not efficient and I got the StackOverflow error as dataset grew.

So my question is supposing all files are in a directory and I read then using sc.textFile("path/*"), how can I understand each data is for which file?

Is it possible (and needed) to customize the textFile method?
Reply | Threaded
Open this post in threaded view
|

Re: How to access line fileName in loading file using the textFile method

Jörn Franke
You can create your own data source exactly doing this.

Why is the file name important if the file content is the same?

> On 24. Sep 2018, at 13:53, Soheil Pourbafrani <[hidden email]> wrote:
>
> Hi, My text data are in the form of text file. In the processing logic, I need to know each word is from which file. Actually, I need to tokenize the words and create the pair of <fileName, word>. The naive solution is to call sc.textFile for each file and having the fileName in a variable, create the pairs, but it's not efficient and I got the StackOverflow error as dataset grew.
>
> So my question is supposing all files are in a directory and I read then using sc.textFile("path/*"), how can I understand each data is for which file?
>
> Is it possible (and needed) to customize the textFile method?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to access line fileName in loading file using the textFile method

Maxim Gekk
In reply to this post by Soheil Pourbafrani
> So my question is supposing all files are in a directory and I read then using sc.textFile("path/*"), how can I understand each data is for which file?


On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani <[hidden email]> wrote:
Hi, My text data are in the form of text file. In the processing logic, I need to know each word is from which file. Actually, I need to tokenize the words and create the pair of <fileName, word>. The naive solution is to call sc.textFile for each file and having the fileName in a variable, create the pairs, but it's not efficient and I got the StackOverflow error as dataset grew.

So my question is supposing all files are in a directory and I read then using sc.textFile("path/*"), how can I understand each data is for which file?

Is it possible (and needed) to customize the textFile method?


--

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[hidden email]

databricks.com

 

Reply | Threaded
Open this post in threaded view
|

Re: How to access line fileName in loading file using the textFile method

vermanurag
In reply to this post by Soheil Pourbafrani
Spark has sc.wholeTextFiles() which returns RDD of tuple. First element of
tuple if the file name and second element is the file content.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]