scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

Ivan Petrov
Hi!
I'm trying to understand the cost of RDD to Dataset conversion
It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 records
It takes around 15 minutes to convert them to Dataset[MyCaseClass]
The shema of MyCaseClass is
str01: String,
str02: String,
str03: String,
str04: String,
long01: Long,
long02: Long,
double01: Double,
map: Map[String, Double]

What can i do in order to run it faster?
Reply | Threaded
Open this post in threaded view
|

Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

srowen
Wouldn't toDS() do this without conversion?

On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov <[hidden email]> wrote:

>
> Hi!
> I'm trying to understand the cost of RDD to Dataset conversion
> It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 records
> It takes around 15 minutes to convert them to Dataset[MyCaseClass]
> The shema of MyCaseClass is
> str01: String,
> str02: String,
> str03: String,
> str04: String,
> long01: Long,
> long02: Long,
> double01: Double,
> map: Map[String, Double]
>
> What can i do in order to run it faster?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

Ivan Petrov
What do you mean "without conversion"? 

def flatten(rdd: RDD[NestedStructure]): Dataset[MyCaseClass] = {
    rdd.flatMap { nestedElement => flatten(nestedElement) /** List[MyCaseClass] */ }
      .toDS()
}
Can it be better?

вт, 14 июл. 2020 г. в 01:13, Sean Owen <[hidden email]>:
Wouldn't toDS() do this without conversion?

On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov <[hidden email]> wrote:
>
> Hi!
> I'm trying to understand the cost of RDD to Dataset conversion
> It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 records
> It takes around 15 minutes to convert them to Dataset[MyCaseClass]
> The shema of MyCaseClass is
> str01: String,
> str02: String,
> str03: String,
> str04: String,
> long01: Long,
> long02: Long,
> double01: Double,
> map: Map[String, Double]
>
> What can i do in order to run it faster?