Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file. #246

Open
sshivam95 opened this issue Jun 24, 2024 · 0 comments

Comments

@sshivam95
Copy link
Contributor

When training an embedding model on a KG, I am getting the following error stack:

Reading with pandas.read_csv with sep ** s+ ** ...
Traceback (most recent call last):
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/bin/dicee", line 33, in <module>
    sys.exit(load_entry_point('dicee', 'console_scripts', 'dicee')())
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/scripts/run.py", line 137, in main
    Execute(get_default_arguments()).start()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/executer.py", line 218, in start
    self.load_indexed_data() if self.is_continual_training else self.read_preprocess_index_serialize_data()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/executer.py", line 88, in read_preprocess_index_serialize_data
    self.knowledge_graph = self.read_or_load_kg()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/executer.py", line 53, in read_or_load_kg
    kg = KG(dataset_dir=self.args.dataset_dir,
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/knowledge_graph.py", line 74, in __init__
    ReadFromDisk(kg=self).start()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/read_from_disk.py", line 28, in start
    self.kg.raw_train_set = read_from_disk(self.kg.path_single_kg,
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/util.py", line 125, in read_from_disk
    return read_with_pandas(data_path, read_only_few, sample_triples_ratio)
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/util.py", line 31, in timeit_wrapper
    result = func(*args, **kwargs)
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/util.py", line 83, in read_with_pandas
    df = pd.read_csv(data_path,
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Initially, I thought it was an issue with the input file, however, after adding engine='python' in pandas.read_csv method in dicee/read_preprocess_save_load_kg/util.py, the error no longer persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant