Re: Read or save specific blocks of a file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Read or save specific blocks of a file

Zois Theodoros
Hello Madhav,

What I did is pretty straight-forward. Let's say that your HDFS block is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.

First use the command: `hdfs fsck Test.csv -locations -blocks -files`. It will return you some very useful information including the list of blocks. So let's say that you want to read the first block (block 0). On the right side of the line that corresponds to block 0 you can find the IP of the machine that holds this specific block in the local file system as well as the blockName (BP-1737920335-xxx.xxx.x.x-1510660262864) and blockID (e.g: blk_1073760915_20091) that will help you later recognize it. So what you need from fsck is the blockName, blockID and the IP of the machine that has the specific block that you are interested in.

After you get these you got everything you need. All you have to do is to connect to the specific IP and execute: `find /data/hdfs-data/datanode/current/blockName/current/finalized/subdir0/ -name blockID`. That command will return you the full path where you can find the contents of your file Test.csv that correspond to one block in HDFS.

What I do after I get the full path is to copy the file, remove the last line (because there is a big chance that the last line will be included in the next block) and store it again to HDFS with the desired name. Then I can access one block of file Test.csv from HDFS. That's all, if you need any further information do no hesitate to contact me.

- Thodoris


On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
Thodoris,

I certainly would be interested in knowing how you were able to identify individual blocks and read from them. I was understanding that HDFS protocol abstracts this from the consumers to prevent potential data corruption issues. Appreciate if you please share some details of your approach.

Thanks!
madhav

On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois <[hidden email]> wrote:
That’s what I did :) If you need further information I can post my solution.. 

- Thodoris

On 30 Apr 2018, at 22:23, David Quiroga <[hidden email]> wrote:

There might be a better way... but I wonder if it might be possible to access the node where the block is store and read it from the local file system rather than from HDFS.  

On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <[hidden email]> wrote:
Hello list,

I have a file on HDFS that is divided into 10 blocks (partitions).

Is there any way to retrieve data from a specific block? (e.g: using
the blockID).

Except that, is there any option to write the contents of each block
(or of one block) into separate files?

Thank you very much,
Thodoris


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Read or save specific blocks of a file

ayan guha
Is this a recommended way of reading data in the long run? I think it might be better to write or look for an InputFormat which supports the need

Btw Block is designed to be hdfs internal representation to enable certain features. It would be interesting to understand the usecase where client app really needs to know about it. It sounds like a questionable design without that context

Best
Ayan

On Fri, 4 May 2018 at 1:46 am, Thodoris Zois <[hidden email]> wrote:
Hello Madhav,

What I did is pretty straight-forward. Let's say that your HDFS block is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.

First use the command: `hdfs fsck Test.csv -locations -blocks -files`. It will return you some very useful information including the list of blocks. So let's say that you want to read the first block (block 0). On the right side of the line that corresponds to block 0 you can find the IP of the machine that holds this specific block in the local file system as well as the blockName (BP-1737920335-xxx.xxx.x.x-1510660262864) and blockID (e.g: blk_1073760915_20091) that will help you later recognize it. So what you need from fsck is the blockName, blockID and the IP of the machine that has the specific block that you are interested in.

After you get these you got everything you need. All you have to do is to connect to the specific IP and execute: `find /data/hdfs-data/datanode/current/blockName/current/finalized/subdir0/ -name blockID`. That command will return you the full path where you can find the contents of your file Test.csv that correspond to one block in HDFS.

What I do after I get the full path is to copy the file, remove the last line (because there is a big chance that the last line will be included in the next block) and store it again to HDFS with the desired name. Then I can access one block of file Test.csv from HDFS. That's all, if you need any further information do no hesitate to contact me.

- Thodoris


On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
Thodoris,

I certainly would be interested in knowing how you were able to identify individual blocks and read from them. I was understanding that HDFS protocol abstracts this from the consumers to prevent potential data corruption issues. Appreciate if you please share some details of your approach.

Thanks!
madhav

On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois <[hidden email]> wrote:
That’s what I did :) If you need further information I can post my solution.. 

- Thodoris

On 30 Apr 2018, at 22:23, David Quiroga <[hidden email]> wrote:

There might be a better way... but I wonder if it might be possible to access the node where the block is store and read it from the local file system rather than from HDFS.  

On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <[hidden email]> wrote:
Hello list,

I have a file on HDFS that is divided into 10 blocks (partitions).

Is there any way to retrieve data from a specific block? (e.g: using
the blockID).

Except that, is there any option to write the contents of each block
(or of one block) into separate files?

Thank you very much,
Thodoris


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Best Regards,
Ayan Guha