Beginners Hadoop question

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Beginners Hadoop question

goi.cto@gmail.com
Hi,

I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment 
1) My inlut file should be one big file or separate smaller files? 
2) if we are using smaller files, how does my code needs to change to process all of the input files?

Will Hadoop just copy the files to different servers or will it also split their content among servers?

Any example will be great!
--
Eran | CTO 
Reply | Threaded
Open this post in threaded view
|

Re: Beginners Hadoop question

Alonso
Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command:

hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE

hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take the Hadoop Fundamentals I course. It is free and very well documented.

Regards

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-03-03 12:10 GMT+01:00 goi cto <[hidden email]>:
Hi,

I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment 
1) My inlut file should be one big file or separate smaller files? 
2) if we are using smaller files, how does my code needs to change to process all of the input files?

Will Hadoop just copy the files to different servers or will it also split their content among servers?

Any example will be great!
--
Eran | CTO 

Reply | Threaded
Open this post in threaded view
|

Re: Beginners Hadoop question

Mohit Singh
Not sure whether I understand your question correctly or not?
If you are trying to use hadoop ( as in map reduce programming model), then basically you would have to use hadoop api's to solve your program.
But if you have data stored in hdfs, and you want to use spark to process that data, then just specify the input path as spark.textFile("hdfs://...")

Take a look at these examples: http://spark.incubator.apache.org/examples.html



On Mon, Mar 3, 2014 at 3:19 AM, Alonso Isidoro Roman <[hidden email]> wrote:
Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command:

hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE

hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take the Hadoop Fundamentals I course. It is free and very well documented.

Regards

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-03-03 12:10 GMT+01:00 goi cto <[hidden email]>:

Hi,

I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment 
1) My inlut file should be one big file or separate smaller files? 
2) if we are using smaller files, how does my code needs to change to process all of the input files?

Will Hadoop just copy the files to different servers or will it also split their content among servers?

Any example will be great!
--
Eran | CTO 




--
Mohit

"When you want success as badly as you want the air, then you will get it. There is no other secret of success."
-Socrates
Reply | Threaded
Open this post in threaded view
|

Re: Beginners Hadoop question

goi.cto@gmail.com
In reply to this post by Alonso
Thanks. I will try it!


On Mon, Mar 3, 2014 at 1:19 PM, Alonso Isidoro Roman <[hidden email]> wrote:
Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command:

hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE

hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take the Hadoop Fundamentals I course. It is free and very well documented.

Regards

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-03-03 12:10 GMT+01:00 goi cto <[hidden email]>:

Hi,

I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment 
1) My inlut file should be one big file or separate smaller files? 
2) if we are using smaller files, how does my code needs to change to process all of the input files?

Will Hadoop just copy the files to different servers or will it also split their content among servers?

Any example will be great!
--
Eran | CTO 




--
Eran | CTO