[Spark Core] Vectorizing very high-dimensional data sourced in long format

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark Core] Vectorizing very high-dimensional data sourced in long format

Daniel Chalef
Hello,

I have a very large long-format dataframe (several billion rows) that I'd like to pivot and vectorize (using the VectorAssembler), with the aim to reduce dimensionality using something akin to TF-IDF. Once pivoted, the dataframe will have ~130 million columns.

The source, long-format schema looks as follows:

root
 |-- entity_id: long (nullable = false)
 |-- attribute_id: long (nullable = false)
 |-- event_count: integer (nullable = true)

Pivoting as per the following fails, exhausting executor and driver memory. I am unsure whether increasing memory limits would be successful here as my sense is that pivoting and then using a VectorAssembler isn't the right approach to solving this problem.

wide_frame = (
    long_frame.groupBy("entity_id")
    .pivot("attribute_id")
    .agg(F.first("event_count"))
)

Are there other Spark patterns that I should attempt in order to achieve my end goal of a vector of attributes for every entity?

Thanks, Daniel
Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Patrick McCarthy-2
That's a very large vector. Is it sparse? Perhaps you'd have better luck performing an aggregate instead of a pivot, and assembling the vector using a UDF.

On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef <[hidden email]> wrote:
Hello,

I have a very large long-format dataframe (several billion rows) that I'd like to pivot and vectorize (using the VectorAssembler), with the aim to reduce dimensionality using something akin to TF-IDF. Once pivoted, the dataframe will have ~130 million columns.

The source, long-format schema looks as follows:

root
 |-- entity_id: long (nullable = false)
 |-- attribute_id: long (nullable = false)
 |-- event_count: integer (nullable = true)

Pivoting as per the following fails, exhausting executor and driver memory. I am unsure whether increasing memory limits would be successful here as my sense is that pivoting and then using a VectorAssembler isn't the right approach to solving this problem.

wide_frame = (
    long_frame.groupBy("entity_id")
    .pivot("attribute_id")
    .agg(F.first("event_count"))
)

Are there other Spark patterns that I should attempt in order to achieve my end goal of a vector of attributes for every entity?

Thanks, Daniel


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Daniel Chalef
Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will explore ways of doing this using an agg and UDF.  

On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy <[hidden email]> wrote:
That's a very large vector. Is it sparse? Perhaps you'd have better luck performing an aggregate instead of a pivot, and assembling the vector using a UDF.

On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef <[hidden email]> wrote:
Hello,

I have a very large long-format dataframe (several billion rows) that I'd like to pivot and vectorize (using the VectorAssembler), with the aim to reduce dimensionality using something akin to TF-IDF. Once pivoted, the dataframe will have ~130 million columns.

The source, long-format schema looks as follows:

root
 |-- entity_id: long (nullable = false)
 |-- attribute_id: long (nullable = false)
 |-- event_count: integer (nullable = true)

Pivoting as per the following fails, exhausting executor and driver memory. I am unsure whether increasing memory limits would be successful here as my sense is that pivoting and then using a VectorAssembler isn't the right approach to solving this problem.

wide_frame = (
    long_frame.groupBy("entity_id")
    .pivot("attribute_id")
    .agg(F.first("event_count"))
)

Are there other Spark patterns that I should attempt in order to achieve my end goal of a vector of attributes for every entity?

Thanks, Daniel


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Kevin Pis
Perhaps it can avoid errors(exhausting executor and driver memory) to add random numbers to the entity_id column when you solve the issue by Patrick's way.

Daniel Chalef <[hidden email]> 于2020年10月31日周六 上午12:42写道:
Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will explore ways of doing this using an agg and UDF.  

On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy <[hidden email]> wrote:
That's a very large vector. Is it sparse? Perhaps you'd have better luck performing an aggregate instead of a pivot, and assembling the vector using a UDF.

On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef <[hidden email]> wrote:
Hello,

I have a very large long-format dataframe (several billion rows) that I'd like to pivot and vectorize (using the VectorAssembler), with the aim to reduce dimensionality using something akin to TF-IDF. Once pivoted, the dataframe will have ~130 million columns.

The source, long-format schema looks as follows:

root
 |-- entity_id: long (nullable = false)
 |-- attribute_id: long (nullable = false)
 |-- event_count: integer (nullable = true)

Pivoting as per the following fails, exhausting executor and driver memory. I am unsure whether increasing memory limits would be successful here as my sense is that pivoting and then using a VectorAssembler isn't the right approach to solving this problem.

wide_frame = (
    long_frame.groupBy("entity_id")
    .pivot("attribute_id")
    .agg(F.first("event_count"))
)

Are there other Spark patterns that I should attempt in order to achieve my end goal of a vector of attributes for every entity?

Thanks, Daniel


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016