Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic retry with larger schema inference length if errors occur #789

Open
lars-reimann opened this issue May 18, 2024 · 0 comments
Open
Labels
enhancement 💡 New feature or request

Comments

@lars-reimann
Copy link
Member

lars-reimann commented May 18, 2024

Is your feature request related to a problem?

The issue with lazy evaluation of data is that errors only occur when we collect the data. At this point, it's no longer possible to fix errors that were caused by previous steps.

For example, if later rows don't match the inferred schema, an error is thrown. Users must then change e.g. their call of Table.from_csv_file and set the inference length (#749) or override parts of the schema (#754).

Ideally, we should automatically recover from such errors.

Desired solution

In Table, don't store a lazy frame directly. Instead, store a factory function that produces a lazy frame. This allows

  1. passing arguments from later steps to produce the lazy frame,
  2. trying again (with different arguments).

When the lazy frame is collected, catch relevant errors, and rebuild the lazy frame

  1. with a larger schema inference length,
  2. if that fails, some columns forced to string type.

We need to be cautious that this works properly with memoization, though.

Possible alternatives (optional)

No response

Screenshots (optional)

No response

Additional Context (optional)

No response

@lars-reimann lars-reimann added the enhancement 💡 New feature or request label May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 💡 New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

1 participant