Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasette shouldn't crash running against a database with missing extensions #2388

Open
simonw opened this issue Aug 12, 2024 · 9 comments
Open

Comments

@simonw
Copy link
Owner

simonw commented Aug 12, 2024

E.g. after https://til.simonwillison.net/sqlite/sqlite-vec my https://til.simonwillison.net/tils.db needs the vec0 extension if you are going to browse the vec_tils table, but the rest of the UI should work by skipping that table when scanning tables.

@asg017
Copy link
Collaborator

asg017 commented Aug 13, 2024

This happens because of check_connection() that we call on every database connection, which calls PRAGMA table_info() on every table. Which means it fails on virtual tables that don't have their parent extension loaded.

We could ignore these class of errors with:

elif e.args[0].startswith("no such module: "):
    pass

But other parts of the codebase will still call PRAGMA table_info() at other times after connection-time, like table_column_details().

We could audit all cases of this and just return None/[] when appropriate.

@simonw simonw mentioned this issue Aug 15, 2024
3 tasks
@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

I'm going to create a minimal sqlite-vec database to commit to our repo for the tests.

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

Creating it with sqlite-utils:

sqlite-utils install sqlite-utils-sqlite-vec
echo '{"point": "[1,2,3]"}' | sqlite-utils insert vec.db vectors -
sqlite-utils vec.db 'create virtual table vec_vectors using vec0(point float[3]);'
sqlite-utils vec.db 'insert into vec_vectors(rowid, point) select rowid, point from vectors'

That's a 48KB file:

ls -lah vec.db 
-rw-r--r--  1 simon  wheel    48K Aug 15 09:26 vec.db

Datasette cannot currently load it (without its own plugin or the --load-extension option):

datasette vec.db
Usage: datasette serve [OPTIONS] [FILES]...
Try 'datasette serve --help' for help.

Error: Connection to vec.db failed check: no such module: vec0

This works:

datasette vec.db --load-extension "$(python -c 'import sqlite_vec; print(sqlite_vec.loadable_path())')"

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

This conflicts slightly with Datasette's existing design with respect to SpatiaLite: currently Datasette shows you a useful error if you try to load a SpatiaLite database without activating the extension:

datasette tests/spatialite.db
Usage: datasette serve [OPTIONS] [FILES]...
Try 'datasette serve --help' for help.

Error: It looks like you're trying to load a SpatiaLite database without first loading the SpatiaLite module.

Try adding the --load-extension=spatialite option.

Read more: https://docs.datasette.io/en/stable/spatialite.html

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

One solution: turn this into a warning that gets dumped to the console but Datasette still starts up as usual.

Could even expand that a bit, so it knows a few other extensions (like sqlite-vec) and can output versions of that helpful message for those as well.

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

Relevant code:

datasette/datasette/cli.py

Lines 791 to 825 in 05dfd34

async def check_databases(ds):
# Run check_connection against every connected database
# to confirm they are all usable
for database in list(ds.databases.values()):
try:
await database.execute_fn(check_connection)
except SpatialiteConnectionProblem:
suggestion = ""
try:
find_spatialite()
suggestion = "\n\nTry adding the --load-extension=spatialite option."
except SpatialiteNotFound:
pass
raise click.UsageError(
"It looks like you're trying to load a SpatiaLite"
+ " database without first loading the SpatiaLite module."
+ suggestion
+ "\n\nRead more: https://docs.datasette.io/en/stable/spatialite.html"
)
except ConnectionProblem as e:
raise click.UsageError(
f"Connection to {database.path} failed check: {str(e.args[0])}"
)
# If --crossdb and more than SQLITE_LIMIT_ATTACHED show warning
if (
ds.crossdb
and len([db for db in ds.databases.values() if not db.is_memory])
> SQLITE_LIMIT_ATTACHED
):
msg = (
"Warning: --crossdb only works with the first {} attached databases".format(
SQLITE_LIMIT_ATTACHED
)
)
click.echo(click.style(msg, bold=True, fg="yellow"), err=True)

Which runs check_connection():

def check_connection(conn):
tables = [
r[0]
for r in conn.execute(
"select name from sqlite_master where type='table'"
).fetchall()
]
for table in tables:
try:
conn.execute(
f"PRAGMA table_info({escape_sqlite(table)});",
)
except sqlite3.OperationalError as e:
if e.args[0] == "no such module: VirtualSpatialIndex":
raise SpatialiteConnectionProblem(e)
else:
raise ConnectionProblem(e)

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

Design options:

  1. Any tables that cannot be read by PRAGMA table_info() are excluded by Datasette entirely - it's as if they do not exist
  2. Those tables are listed as "unreadable" but are still shown somewhere in the Datasette interface or the Datasette API
  3. Tables are invisible within Datasette but are listed in messages dumped to stderr

The second option is nicer, but requires more design decisions: where should that message go? How should we explain it to users?

So I think option 3 is the way to go for the moment.

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

OK, this is actually a lot of work to implement. I tried this so far:

diff --git a/datasette/utils/__init__.py b/datasette/utils/__init__.py
index 073d6e86..5469252b 100644
--- a/datasette/utils/__init__.py
+++ b/datasette/utils/__init__.py
@@ -19,6 +19,7 @@ import time
 import types
 import secrets
 import shutil
+import sys
 from typing import Iterable, List, Tuple
 import urllib
 import yaml
@@ -978,7 +979,8 @@ def check_connection(conn):
             if e.args[0] == "no such module: VirtualSpatialIndex":
                 raise SpatialiteConnectionProblem(e)
             else:
-                raise ConnectionProblem(e)
+                print(str(e), file=sys.stderr)
+                # raise ConnectionProblem(e)
 
 
 class BadMetadataError(Exception):
diff --git a/datasette/utils/internal_db.py b/datasette/utils/internal_db.py
index 626dd137..f5ae68b2 100644
--- a/datasette/utils/internal_db.py
+++ b/datasette/utils/internal_db.py
@@ -1,5 +1,6 @@
+import sys
 import textwrap
-from datasette.utils import table_column_details
+from datasette.utils import table_column_details, sqlite3
 
 
 async def init_internal_db(db):
@@ -137,7 +138,11 @@ async def populate_schema_tables(internal_db, db):
             tables_to_insert.append(
                 (database_name, table_name, table["rootpage"], table["sql"])
             )
-            columns = table_column_details(conn, table_name)
+            try:
+                columns = table_column_details(conn, table_name)
+            except sqlite3.OperationalError as ex:
+                print(f"Error accessing table {table_name}: {str(ex)}", file=sys.stderr)
+                continue
             columns_to_insert.extend(
                 {
                     **{"database_name": database_name, "table_name": table_name},

And it's revealing that there are MANY places in Datasette that might attempt to read the columns from one of these tables and run into an error.

For example, running this above patch with datasette tests/vec.db --pdb I got this traceback on trying to load the homepage:

INFO:     Uvicorn running on http://127.0.0.1:8002 (Press CTRL+C to quit)
Error accessing table vec_vectors: no such module: vec0
ERROR: conn=<sqlite3.Connection object at 0x110467140>, sql = 'select count(*) from [vec_vectors]', params = None: no such module: vec0
> /Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py(640)table_column_details()
-> for r in conn.execute(
(Pdb) c
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/datasette/datasette/app.py", line 1745, in route_path
    response = await view(request, send)
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/base.py", line 184, in view
    return await self.dispatch_request(request)
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/base.py", line 138, in dispatch_request
    response = await handler(request)
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/index.py", line 68, in get
    table_columns = await db.table_columns(table)
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 424, in table_columns
    return await self.execute_fn(lambda conn: table_columns(conn, table))
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 283, in execute_fn
    return await asyncio.get_event_loop().run_in_executor(
  File "/Users/simon/.pyenv/versions/3.10.4/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 281, in in_thread
    return fn(conn)
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 424, in <lambda>
    return await self.execute_fn(lambda conn: table_columns(conn, table))
  File "/Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py", line 632, in table_columns
    return [column.name for column in table_column_details(conn, table)]
  File "/Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py", line 640, in table_column_details
    for r in conn.execute(
sqlite3.OperationalError: no such module: vec0

I don't think I can fix this for the 1.0a15 alpha:

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2024

This would be a bit easier if I moved all code in Datasette that loops through a list of tables to consult the internal database catalog tables for those, then I could include these unreadable tables in that but have a "unreadable = 1" column which could be used to filter them out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants