Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zipfile.BadZipFile error when reading .xlsx files. #56

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

craigastill
Copy link
Contributor

Use the name of the .xlsx, when opening the workbook, since the following error is thrown when attempting to read the io.TextIOWrapper instance:

zipfile.BadZipFile: File is not a zip file when loading an .xlsx file

Fixes #54.

Use the name of the `.xlsx`, when opening the workbook, since the following
error is thrown when attempting to read the `io.TextIOWrapper` instance:

`zipfile.BadZipFile: File is not a zip file when loading an .xlsx file`

Issue: ets#54.
@craigastill
Copy link
Contributor Author

craigastill commented May 18, 2023

My quick .name hack works for local files, but is throwing: FileNotFoundError: [Errno 2] No such file or directory: <sub-folder>/<file>.xlsx, when running meltano invoke tap-<custom_tap> --dev when attempting to extract the same file from an S3 bucket.

Edit: redacted output:

  INFO Using supplied catalog /path/to/meltano_project/.meltano/run/tap-custom-tap/tap.properties.json.
  INFO Processing 1 selected streams from Catalog
  INFO Syncing stream:account_transactions
  {"type": "SCHEMA", "stream": "account_transactions", "schema": {"properties": {"date": {"type": ["null", "string"]}, "contact": {"type": ["null", "string"]}, "description": {"type": ["null", "string"]}, "invoice_number": {"type": ["null", "string"]}, "reference": {"type": ["null", "string"]}, "debit_gbp": {"type": ["null", "string"]}, "credit_gbp": {"type": ["null", "string"]}, "gross_gbp": {"type": ["null", "string"]}, "net_gbp": {"type": ["null", "string"]}, "vat_gbp": {"type": ["null", "string"]}, "account_code": {"type": ["null", "integer"]}, "account": {"type": ["null", "string"]}, "account_type": {"type": ["null", "string"]}, "revenue_type": {"type": ["null", "string"]}, "source": {"type": ["null", "string"]}, "contact_group": {"type": ["null", "string"]}, "debit": {"type": ["null", "string"]}, "credit": {"type": ["null", "string"]}, "gross": {"type": ["null", "string"]}, "net": {"type": ["null", "string"]}, "vat": {"type": ["null", "string"]}, "vat_rate": {"type": ["null", "integer"]}, "vat_rate_name": {"type": ["null", "string"]}, "region": {"type": ["null", "string"]}, "related_account": {"type": ["null", "string"]}, "_smart_source_bucket": {"type": "string"}, "_smart_source_file": {"type": "string"}, "_smart_source_lineno": {"type": "integer"}}, "selected": true, "type": "object"}, "key_properties": []}
  INFO Loading cached SSO token for default
  INFO Found 2 files.
  INFO Checking 2 resolved objects for any that match regular expression "account_transactions/.*xlsx$" and were modified since 1970-01-01 00:00:00+00:00
  INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
  INFO Syncing file "account_transactions/issue_52_bad_sample_file.xlsx".
  CRITICAL [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'
  Traceback (most recent call last):
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/singer/utils.py", line 235, in wrapped
      return fnc(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 162, in main
      sync(tables_config, args.state, catalog)
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 117, in sync
      records_streamed += file_utils.write_file(t_file['key'], table_spec, merged_schema, max_records=max_records_per_run-records_streamed)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 46, in write_file
      iterator = tap_spreadsheets_anywhere.format_handler.get_row_iterator(table_spec, target_uri)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/format_handler.py", line 164, in get_row_iterator
      iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/excel_handler.py", line 72, in get_row_iterator
      workbook = openpyxl.load_workbook(file_handle.name, read_only=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
      reader = ExcelReader(filename, read_only, keep_vba,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
      self.archive = _validate_archive(fn)
                     ^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
      archive = ZipFile(filename, 'r')
                ^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py", line 1283, in __init__
      self.fp = io.open(file, filemode)
                ^^^^^^^^^^^^^^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'

NOTE: Without this change, I get: zipfile.BadZipFile: File is not a zip file, in both local/S3 scenarios.

@craigastill
Copy link
Contributor Author

Diving in with pdb. With the code in main and using my sample file in S3, this is hitting the exception in the builtin: zipfile (https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L247). Which is throwing:

(Pdb) fpin.seek(-sizeEndCentDir, 2)
fpin.seek(-sizeEndCentDir, 2)
*** io.UnsupportedOperation: can't do nonzero end-relative seeks

Which is then re-raised on: https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L1367 as: BadZipFile("File is not a zip file"). zipfile Code:

    def _RealGetContents(self):
        """Read in the table of contents for the ZIP file."""
        fp = self.fp
        try:
            endrec = _EndRecData(fp)
        except OSError:
            raise BadZipFile("File is not a zip file")
        if not endrec:
            raise BadZipFile("File is not a zip file")

Was worried that my S3 sourced file was not completely streamed, but did the same check with the local .xlsx file as done by the built-in zipfile._EndRecData(fpin) (https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L285-L293). ie.

>>> f = open("/path/to/file.xlsx", errors="surrogateescape")  # `errors` is used to match current code + avoid: `*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 10: invalid continuation byte` when read.
>>> f.seek(0, 2)
>>> f.tell()
<int>  # Confirmed same for local vs S3 file pulled file during pdb session.

@amotl
Copy link
Contributor

amotl commented Dec 13, 2023

Hi there,

the same error also happens when running the test cases. test_handle_newlines_local_excel and test_smart_columns both trip with zipfile.BadZipFile: File is not a zip file. I haven't investigated yet if this is related to your report.

With kind regards,
Andreas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zipfile.BadZipFile: File is not a zip file when loading an .xlsx file
2 participants