Tab delimited csv import and empty columns

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Tab delimited csv import and empty columns

Stephen Coy
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C

This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/
Reply | Threaded
Open this post in threaded view
|

Re: Tab delimited csv import and empty columns

German Schiavon Matteo
Hey, 

I understand that your empty values in your CSV are "" , if so, try this option: 

.option("emptyValue", "\"\"")

Hope it helps 

On Thu, 30 Jul 2020 at 08:49, Stephen Coy <[hidden email]> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C

This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/
Reply | Threaded
Open this post in threaded view
|

Re: Tab delimited csv import and empty columns

Stephen Coy
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any that load a CSV text file and verify that an empty string maps to an empty string which I think is supposed to be the default behaviour because the “nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo <[hidden email]> wrote:

Hey, 

I understand that your empty values in your CSV are "" , if so, try this option: 

.option("emptyValue", "\"\"")

Hope it helps 

On Thu, 30 Jul 2020 at 08:49, Stephen Coy <[hidden email]> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C

This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/

Reply | Threaded
Open this post in threaded view
|

Re: Tab delimited csv import and empty columns

srowen
Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default.

On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy <[hidden email]> wrote:
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any that load a CSV text file and verify that an empty string maps to an empty string which I think is supposed to be the default behaviour because the “nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo <[hidden email]> wrote:

Hey, 

I understand that your empty values in your CSV are "" , if so, try this option: 

.option("emptyValue", "\"\"")

Hope it helps 

On Thu, 30 Jul 2020 at 08:49, Stephen Coy <[hidden email]> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C

This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/

Reply | Threaded
Open this post in threaded view
|

Re: Tab delimited csv import and empty columns

Vladimir Ryzhov
Would df.na.fill("") do the trick?

On Fri, Jul 31, 2020 at 8:43 AM Sean Owen <[hidden email]> wrote:
Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default.

On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy <[hidden email]> wrote:
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any that load a CSV text file and verify that an empty string maps to an empty string which I think is supposed to be the default behaviour because the “nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo <[hidden email]> wrote:

Hey, 

I understand that your empty values in your CSV are "" , if so, try this option: 

.option("emptyValue", "\"\"")

Hope it helps 

On Thu, 30 Jul 2020 at 08:49, Stephen Coy <[hidden email]> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C

This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/

Reply | Threaded
Open this post in threaded view
|

Re: Tab delimited csv import and empty columns

Stephen Coy
In reply to this post by srowen
Hi Sean, German and others,

Setting the “nullValue” option (for parsing CSV at least) seems to be an exercise in futility.

When parsing the file, com.univocity.parsers.common.input.AbstractCharInputReader#getString contains the following logic:

String out;
if (len <= 0) {
out = nullValue;
} else {
out = new String(buffer, pos, len);
}
resulting in the nullValue being assigned to the column value if it has zero length, such as with an empty String.

Later, org.apache.spark.sql.catalyst.csv.UnivocityParser#nullSafeDatum is called on the column value:

if (datum == options.nullValue || datum == null) {
if (!nullable) {
throw new RuntimeException(s"null value found but field $name is not nullable.")
}
null
} else {
converter.apply(datum)
}
Therefore, the empty String is first converted to the nullValue, and then matched against the nullValue and, bingo, we get the literal null.

For now, the “.na.fill(“”)” addition to the code is doing the right thing for me.

Thanks for all the help.


Steve C


On 1 Aug 2020, at 1:40 am, Sean Owen <[hidden email]> wrote:

Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default.

On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy <[hidden email]> wrote:
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any that load a CSV text file and verify that an empty string maps to an empty string which I think is supposed to be the default behaviour because the “nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo <[hidden email]> wrote:

Hey, 

I understand that your empty values in your CSV are "" , if so, try this option: 

.option("emptyValue", "\"\"")

Hope it helps 

On Thu, 30 Jul 2020 at 08:49, Stephen Coy <[hidden email]> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset<Row> catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C

This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/