Need to convert Dataset<Row> to HashMap<String, String>

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Need to convert Dataset<Row> to HashMap<String, String>

rishmanisation
I am writing a data-profiling application that needs to iterate over a large
.gz file (imported as a Dataset<Row>). Each key-value pair in the hashmap
will be the row value and the number of times it occurs in the column. There
is one hashmap for each column, and they are all added to a JSON at the end.

For now, I am using the following logic to generate the hashmap for a
column:

Dataset<Row> freq = df
                .groupBy(columnName)
                .count();

HashMap<String, String> myHashMap = new HashMap<>();

Iterator<Row> rowIterator = freq.toLocalIterator();
while(rowIterator.hasNext()) {
            Row currRow = rowIterator.next();
            String rowString = currRow.toString();
            String[] contents = rowString.substring(1, rowString.length() -
1).split(",");
            Double percent = Long.valueOf(contents[1])*100.0/numOfRows;
            myHashMap.put(contents[0], Double.toString(percent));
}

I have also tried converting to RDD and using the collectAsMap() function,
but both of these are taking a very long time (about 5 minutes per column,
where each column has approx. 30 million rows). Is there a more efficient
way to achieve the same?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need to convert Dataset<Row> to HashMap<String, String>

Alessandro Solimando
Hi,
as a first attempt I would try to cache "freq", to be sure that the dataset is not re-loaded at each iteration later on.

Btw, what's the original data format you are importing from?

I suspect also that an appropriate case class rather than Row would help as well, instead of converting to String and parsing it "manually".

Hth,
Alessandro

On Fri, 28 Sep 2018 at 01:48, rishmanisation <[hidden email]> wrote:
I am writing a data-profiling application that needs to iterate over a large
.gz file (imported as a Dataset<Row>). Each key-value pair in the hashmap
will be the row value and the number of times it occurs in the column. There
is one hashmap for each column, and they are all added to a JSON at the end.

For now, I am using the following logic to generate the hashmap for a
column:

Dataset<Row> freq = df
                .groupBy(columnName)
                .count();

HashMap<String, String> myHashMap = new HashMap<>();

Iterator<Row> rowIterator = freq.toLocalIterator();
while(rowIterator.hasNext()) {
            Row currRow = rowIterator.next();
            String rowString = currRow.toString();
            String[] contents = rowString.substring(1, rowString.length() -
1).split(",");
            Double percent = Long.valueOf(contents[1])*100.0/numOfRows;
            myHashMap.put(contents[0], Double.toString(percent));
}

I have also tried converting to RDD and using the collectAsMap() function,
but both of these are taking a very long time (about 5 minutes per column,
where each column has approx. 30 million rows). Is there a more efficient
way to achieve the same?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need to convert Dataset<Row> to HashMap<String, String>

rishmanisation
Thanks for the response! I'm not sure caching 'freq' would make sense, since
there are multiple columns in the file and so it will need to be different
for different columns.

Original data format is .gz (gzip).

I am a newbie to Spark, so could you please give a little more details on
the appropriate case class?

Thanks!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need to convert Dataset<Row> to HashMap<String, String>

Alessandro Solimando
Hi,
sorry indeed you have to cache the dataset, before the groupby (otherwise it will be loaded at each time from disk).


Best regards,
Alessandro

On Fri, 28 Sep 2018 at 09:29, rishmanisation <[hidden email]> wrote:
Thanks for the response! I'm not sure caching 'freq' would make sense, since
there are multiple columns in the file and so it will need to be different
for different columns.

Original data format is .gz (gzip).

I am a newbie to Spark, so could you please give a little more details on
the appropriate case class?

Thanks!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need to convert Dataset<Row> to HashMap<String, String>

rishmanisation
Thanks for the help so far. I tried caching but the operation seems to be
taking forever. Any tips on how I can speed up this operation?

Also I am not sure case class would work, since different files have
different structures (I am parsing a 1GB file right now but there are a few
different files that I also need to run this on).



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]