parition by multiple columns/keys

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

parition by multiple columns/keys

Imran Rajjad
Hi,

I have a set of rows that are a result of a groupBy(col1,col2,col3).count().

Is it possible to map rows belong to unique combination inside an iterator?

e.g

col1 col2 col3
a      1      a1
a      1      a2
b      2      b1
b      2      b2

how can I separate rows with col1 and col2 = (a,1) and (b,2)?

regards,
Imran

--
I.R
Reply | Threaded
Open this post in threaded view
|

Re: parition by multiple columns/keys

ayan guha
How or what you want to achieve? Ie are planning to do some aggregation on group by c1,c2?

On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <[hidden email]> wrote:
Hi,

I have a set of rows that are a result of a groupBy(col1,col2,col3).count().

Is it possible to map rows belong to unique combination inside an iterator?

e.g

col1 col2 col3
a      1      a1
a      1      a2
b      2      b1
b      2      b2

how can I separate rows with col1 and col2 = (a,1) and (b,2)?

regards,
Imran

--
I.R
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: parition by multiple columns/keys

Imran Rajjad
yes..I think I figured out something like below

Serialized Java Class
-----------------
public class MyMapPartition implements Serializable,MapPartitionsFunction{
 @Override
 public Iterator call(Iterator iter) throws Exception {
  ArrayList<Row> list = new ArrayList<Row>();
  // ArrayNode array = mapper.createArrayNode();
  Row row=null;
  System.out.println("--------");
  while(iter.hasNext()){
   
   row=(Row) iter.next();
   System.out.println(row);
   list.add(row);
  }
  System.out.println(">>>>");
  return list.iterator();
 }
}

Unit Test
-----------
JavaRDD<Row> rdd = jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L)
              ,RowFactory.create(11L,22L,2L)
              ,RowFactory.create(11L,22L,1L)
              ,RowFactory.create(12L,23L,3L)
              ,RowFactory.create(12L,24L,3L)
              ,RowFactory.create(12L,22L,4L)
              ,RowFactory.create(13L,22L,4L)
              ,RowFactory.create(14L,22L,4L)
    ));
  StructType structType = new StructType();
  structType = structType.add("a", DataTypes.LongType, false)  
        .add("b", DataTypes.LongType, false)  
        .add("c", DataTypes.LongType, false);  
  ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
  

  Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
  ds.show();
  
  MyMapPartition mp = new MyMapPartition ();
//Iterator<Row>
  //.repartition(new Column("a"),new Column("b"))
   Dataset<Row> grouped = ds.groupBy("a", "b","c")
    .count()
    .repartition(new Column("a"),new Column("b"))
    .mapPartitions(mp,encoder);
    
  grouped.count();

---------------

output
--------
--------
[12,23,3,1]
>>>>
--------
[14,22,4,1]
>>>>
--------
[12,24,3,1]
>>>>
--------
[12,22,4,1]
>>>>
--------
[11,22,1,1]
[11,22,2,1]
>>>>
--------
[11,21,1,1]
>>>>
--------
[13,22,4,1]
>>>>


On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <[hidden email]> wrote:
How or what you want to achieve? Ie are planning to do some aggregation on group by c1,c2?

On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <[hidden email]> wrote:
Hi,

I have a set of rows that are a result of a groupBy(col1,col2,col3).count().

Is it possible to map rows belong to unique combination inside an iterator?

e.g

col1 col2 col3
a      1      a1
a      1      a2
b      2      b1
b      2      b2

how can I separate rows with col1 and col2 = (a,1) and (b,2)?

regards,
Imran

--
I.R
--
Best Regards,
Ayan Guha



--
I.R
Reply | Threaded
Open this post in threaded view
|

Re: parition by multiple columns/keys

Imran Rajjad
strangely this is working only for very small dataset of rows.. for very large datasets apparently the partitioning is not working. is there a limit to the number of columns or rows when repartitioning according to multiple columns?

regards,
Imran

On Wed, Oct 18, 2017 at 11:00 AM, Imran Rajjad <[hidden email]> wrote:
yes..I think I figured out something like below

Serialized Java Class
-----------------
public class MyMapPartition implements Serializable,MapPartitionsFunction{
 @Override
 public Iterator call(Iterator iter) throws Exception {
  ArrayList<Row> list = new ArrayList<Row>();
  // ArrayNode array = mapper.createArrayNode();
  Row row=null;
  System.out.println("--------");
  while(iter.hasNext()){
   
   row=(Row) iter.next();
   System.out.println(row);
   list.add(row);
  }
  System.out.println(">>>>");
  return list.iterator();
 }
}

Unit Test
-----------
JavaRDD<Row> rdd = jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L)
              ,RowFactory.create(11L,22L,2L)
              ,RowFactory.create(11L,22L,1L)
              ,RowFactory.create(12L,23L,3L)
              ,RowFactory.create(12L,24L,3L)
              ,RowFactory.create(12L,22L,4L)
              ,RowFactory.create(13L,22L,4L)
              ,RowFactory.create(14L,22L,4L)
    ));
  StructType structType = new StructType();
  structType = structType.add("a", DataTypes.LongType, false)  
        .add("b", DataTypes.LongType, false)  
        .add("c", DataTypes.LongType, false);  
  ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
  

  Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
  ds.show();
  
  MyMapPartition mp = new MyMapPartition ();
//Iterator<Row>
  //.repartition(new Column("a"),new Column("b"))
   Dataset<Row> grouped = ds.groupBy("a", "b","c")
    .count()
    .repartition(new Column("a"),new Column("b"))
    .mapPartitions(mp,encoder);
    
  grouped.count();

---------------

output
--------
--------
[12,23,3,1]
>>>>
--------
[14,22,4,1]
>>>>
--------
[12,24,3,1]
>>>>
--------
[12,22,4,1]
>>>>
--------
[11,22,1,1]
[11,22,2,1]
>>>>
--------
[11,21,1,1]
>>>>
--------
[13,22,4,1]
>>>>


On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <[hidden email]> wrote:
How or what you want to achieve? Ie are planning to do some aggregation on group by c1,c2?

On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <[hidden email]> wrote:
Hi,

I have a set of rows that are a result of a groupBy(col1,col2,col3).count().

Is it possible to map rows belong to unique combination inside an iterator?

e.g

col1 col2 col3
a      1      a1
a      1      a2
b      2      b1
b      2      b2

how can I separate rows with col1 and col2 = (a,1) and (b,2)?

regards,
Imran

--
I.R
--
Best Regards,
Ayan Guha



--
I.R



--
I.R