What I did is pretty straight-forward. Let's say that your HDFS block is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.
First use the command: `hdfs fsck Test.csv -locations -blocks -files`. It will return you some very useful information including the list of blocks. So let's say that you want to read the first block (block 0). On the right side of the line that corresponds to block 0 you can find the IP of the machine that holds this specific block in the local file system as well as the blockName (BP-1737920335-xxx.xxx.x.x-1510660262864) and blockID (e.g: blk_1073760915_20091) that will help you later recognize it. So what you need from fsck is the blockName, blockID and the IP of the machine that has the specific block that you are interested in.
After you get these you got everything you need. All you have to do is to connect to the specific IP and execute: `find /data/hdfs-data/datanode/current/blockName/current/finalized/subdir0/ -name blockID`. That command will return you the full path where you can find the contents of your file Test.csv that correspond to one block in HDFS.
What I do after I get the full path is to copy the file, remove the last line (because there is a big chance that the last line will be included in the next block) and store it again to HDFS with the desired name. Then I can access one block of file Test.csv from HDFS. That's all, if you need any further information do no hesitate to contact me.
On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
Is this a recommended way of reading data in the long run? I think it might be better to write or look for an InputFormat which supports the need
Btw Block is designed to be hdfs internal representation to enable certain features. It would be interesting to understand the usecase where client app really needs to know about it. It sounds like a questionable design without that context
On Fri, 4 May 2018 at 1:46 am, Thodoris Zois <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|