Reading Tweets (JSON) in a file into RDD Spark

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading Tweets (JSON) in a file into RDD Spark

ssimanta
I'm new to Spark. 

I have a bunch of files (in HDFS) that has a bunch of tweets (in JSON format.) 
I want to read and parse these into a RDD so that I can do some interactive processing on these tweets. 

Has someone done something like this before ? Example ? 

I though I would ask before implementing one myself from scratch. 

Thanks
-Soumya

Reply | Threaded
Open this post in threaded view
|

Re: Reading Tweets (JSON) in a file into RDD Spark

Akhil Das
If those files arent going to grow, then you can use the simple textFile and do all your processing. 
Sample code is below:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp{

def main(args: Array[String]){

val sc = new SparkContext("local", "Simple HDFS App", "/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))

val textFile = sc.textFile("hdfs://127.0.0.1:54310/akhld/tweet1.json")
textFile.take(10).foreach(println) 

}
}

If they are growing, then i think you might want to use textFileStream or FileStream which will takecare of the processing of new files.

    
-
AkhilDas
CodeBreach.in
Reply | Threaded
Open this post in threaded view
|

Re: Reading Tweets (JSON) in a file into RDD Spark

ssimanta
Thanks Akhil. 

In the above example, are you assuming that there is a tweet per line (i.e., tweets are new line separated) ? 

On an unrelated note, can you send pointers about how to run this standalone example. Till now I've only played with the interactive spark-shell and yet to run a standalone scala program in cluster mode. 





On Tue, Feb 4, 2014 at 12:38 PM, Akhil Das <[hidden email]> wrote:
If those files arent going to grow, then you can use the simple textFile and do all your processing. 
Sample code is below:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp{

def main(args: Array[String]){

val sc = new SparkContext("local", "Simple HDFS App", "/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))

val textFile = sc.textFile("hdfs://127.0.0.1:54310/akhld/tweet1.json")
textFile.take(10).foreach(println) 

}
}

If they are growing, then i think you might want to use textFileStream or FileStream which will takecare of the processing of new files.

    
-
AkhilDas
CodeBreach.in

Reply | Threaded
Open this post in threaded view
|

Re: Reading Tweets (JSON) in a file into RDD Spark

Akhil Das
Yes, Soumya that file contents are newline separated.

You can run that program in 4 Steps (Hoping that you already have your spark/hadoop up and running)

1. Copy the code and paste as SimpleApp.scala
2. Create a sbt build file with all dependencies, which is pasted below
3. do a sbt package
4. then sbt run

simple.sbt

name := "Simple Project"

version := "1.0"

scalaVersion := "2.9.3"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "0.8.0-incubating"

resolvers ++= Seq("Akka Repository" at "http://repo.akka.io/releases/","Spray Repository" at "http://repo.spray.cc/")


-
AkhilDas
CodeBreach.in