how to generate a larg dataset paralleled

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

how to generate a larg dataset paralleled

lk_spark
hi,all:
    I want't to generate some test data , which contained about one hundred million rows .
    I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node.
    how can I do it on the whole cluster.
 
2018-12-14

lk_spark
jgp
Reply | Threaded
Open this post in threaded view
|

Re: how to generate a larg dataset paralleled

jgp
You just want to generate some data in Spark or ingest a large dataset outside of Spark? What’s the ultimate goal you’re pursuing?

jg


On Dec 13, 2018, at 21:38, lk_spark <[hidden email]> wrote:

hi,all:
    I want't to generate some test data , which contained about one hundred million rows .
    I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node.
    how can I do it on the whole cluster.
 
2018-12-14

lk_spark
Reply | Threaded
Open this post in threaded view
|

Re: Re: how to generate a larg dataset paralleled

lk_spark
generate some data in Spark .
 
2018-12-14
lk_spark

发件人:Jean Georges Perrin <[hidden email]>
发送时间:2018-12-14 11:10
主题:Re: how to generate a larg dataset paralleled
收件人:"lk_spark"<[hidden email]>
抄送:"user.spark"<[hidden email]>
 
You just want to generate some data in Spark or ingest a large dataset outside of Spark? What’s the ultimate goal you’re pursuing?

jg


On Dec 13, 2018, at 21:38, lk_spark <[hidden email]> wrote:

hi,all:
    I want't to generate some test data , which contained about one hundred million rows .
    I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node.
    how can I do it on the whole cluster.
 
2018-12-14

lk_spark
Reply | Threaded
Open this post in threaded view
|

Re: how to generate a larg dataset paralleled

15313776907
In reply to this post by lk_spark

I also have this problem, hope to be able to solve here, thank you
On 12/14/2018 10:38[hidden email] wrote:
hi,all:
    I want't to generate some test data , which contained about one hundred million rows .
    I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node.
    how can I do it on the whole cluster.
 
2018-12-14

lk_spark
Reply | Threaded
Open this post in threaded view
|

Re: Re: how to generate a larg dataset paralleled

lk_spark
sorry, now what I can do is like this :
 
var df5 = spark.read.parquet("/user/devuser/testdata/df1").coalesce(1)
df5 = df5.union(df5).union(df5).union(df5).union(df5)
 
2018-12-14
lk_spark

发件人:15313776907 <[hidden email]>
发送时间:2018-12-14 16:39
主题:Re: how to generate a larg dataset paralleled
 

I also have this problem, hope to be able to solve here, thank you
On 12/14/2018 10:38[hidden email] wrote:
hi,all:
    I want't to generate some test data , which contained about one hundred million rows .
    I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node.
    how can I do it on the whole cluster.
 
2018-12-14

lk_spark