Ability to have CountVectorizerModel vocab as empty

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Ability to have CountVectorizerModel vocab as empty

purijatin
Hello,


require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")

Currently, if `CountVectorizer` is trained on an empty dataset results in the following exception. But it is perfectly valid use case to send it empty data (or if minDF filters everything).
HashingTF works fine in such scenarios. CountVectorizer doesn't.

Can we remove this constraint? Happy to send a pull-request
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. Lower minDF as necessary.
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
	at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
Reply | Threaded
Open this post in threaded view
|

Re: Ability to have CountVectorizerModel vocab as empty

srowen
I think that's true. You're welcome to open a pull request / JIRA to
remove that requirement.

On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <[hidden email]> wrote:

>
> Hello,
>
> This is wrt https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
>
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
>
> Currently, if `CountVectorizer` is trained on an empty dataset results in the following exception. But it is perfectly valid use case to send it empty data (or if minDF filters everything).
> HashingTF works fine in such scenarios. CountVectorizer doesn't.
>
> Can we remove this constraint? Happy to send a pull-request
>
> java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. Lower minDF as necessary.
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Ability to have CountVectorizerModel vocab as empty

purijatin
Thanks Sean for the quick response.


Will send a pull request shortly.

Regards,
Jatin

On Wed, Aug 19, 2020 at 6:58 PM Sean Owen <[hidden email]> wrote:
I think that's true. You're welcome to open a pull request / JIRA to
remove that requirement.

On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <[hidden email]> wrote:
>
> Hello,
>
> This is wrt https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
>
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
>
> Currently, if `CountVectorizer` is trained on an empty dataset results in the following exception. But it is perfectly valid use case to send it empty data (or if minDF filters everything).
> HashingTF works fine in such scenarios. CountVectorizer doesn't.
>
> Can we remove this constraint? Happy to send a pull-request
>
> java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. Lower minDF as necessary.
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)


--