Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: CSV options do not fulfil our needs #16356

Open
rad-pat opened this issue Aug 30, 2024 · 2 comments
Open

Feature: CSV options do not fulfil our needs #16356

rad-pat opened this issue Aug 30, 2024 · 2 comments
Assignees
Labels
C-feature Category: feature

Comments

@rad-pat
Copy link

rad-pat commented Aug 30, 2024

We are migrating over from Postgres/Greenplum where, when importing from CSV file, we are able to specify that a blank entry represent null, but a blank quoted string is an empty string. I cannot replicate this behaviour with Databend, but would very much like to. I have tried many combinations of the EMPTY_FIELD_AS and NULL_DISPLAY but I cannot get the same as the Postgres import.
The ability to differentiate between blank entry and empty string would be very useful to us.

CSV Data:

some,value,,"",\N
some,other,,"",\N

Postgres Import

CREATE TABLE t_csv (
	col1 text null,
	col2 text null,
	"empty" text null,
	quoted text null,
	null_repr text null
);

COPY t_csv
FROM '/home/plaid/Downloads/test.csv'
WITH (
	FORMAT CSV,
	DELIMITER ',',
	QUOTE '"',
	ESCAPE '"',
	NULL ''
);

select * from t_csv;

Result

col1|col2 |empty |quoted|null_repr|
----+-----+------+------+---------+
some|value|<null>|      |\N       |
some|other|<null>|      |\N       |

Databend Import

CREATE TABLE t_csv (
	col1 string null,
	col2 string null,
	empty string null,
	quoted string null,
	null_repr string null
);
COPY INTO t_csv
FROM 'gcs://<snip>/test.csv'
CONNECTION = (
	CREDENTIAL='<snip>'
)
--PATTERN='test.csv'
FILE_FORMAT = (
	TYPE = CSV,
	FIELD_DELIMITER = ',', 
	QUOTE = '"', 
	ESCAPE = '\\', 
	NULL_DISPLAY = '\N', 
	EMPTY_FIELD_AS = string, 
	SKIP_HEADER = 0,
	COMPRESSION = AUTO 
)
FORCE = TRUE;

Result

"some","value",null,"","\N"
"some","other",null,"","\N"
col1|col2 |empty|quoted|null_repr|
----+-----+-----+------+---------+
some|value|     |      |<null>   |
some|other|     |      |<null>   |
@rad-pat rad-pat added the C-feature Category: feature label Aug 30, 2024
@youngsofun
Copy link
Member

youngsofun commented Aug 31, 2024

@rad-pat thanks for feedback, it is a known problem.

  1. The CSV spec doesn’t differentiate them
  2. the author of rust-csv, which databend depends on, sticks to the spec Handling null vs empty strings BurntSushi/rust-csv#114
  3. spark do not distinguish them after 2.0.1, https://mrpowers.medium.com/sparks-treatment-of-empty-strings-and-null-values-in-csv-files-80748893451f (it is interesting that the article state that spark <=2.0.0 read blank/missing as empty string and the quoted one as null, reverse to what we expected)

a workarounds is to dump the csv with special strings for null instead of "".

we may consider to support distinguish them too, recently python 13 seems to support it with some options, python/cpython#113732

BTW, why do you have to distinguish null and empty string in your application?

@inviscid
Copy link

We receive CSVs from upstream customer systems that we have no control over. For the most part NULL is the correct default. However, there are real world situations where a blank string has a different meaning than null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

3 participants