Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"\\" in row breaks PyDruid - JSONDecodeError('Unterminated string...') #242

Open
dklei opened this issue Nov 27, 2020 · 2 comments · May be fixed by #249
Open

"\\" in row breaks PyDruid - JSONDecodeError('Unterminated string...') #242

dklei opened this issue Nov 27, 2020 · 2 comments · May be fixed by #249

Comments

@dklei
Copy link

dklei commented Nov 27, 2020

Hi,

I'm using pydruid.db.connector to run a query that pulls a row where the content that is returned ends in "...\\", and this appears to break pydruid, meaning it either drops rows from the data or fails with a JSONDecodeError.

e.g. "SELECT x FROM y" -> [{"x": "some row"},{"x": "...\\"},{"x": "another row"},{"x": "more rows"}]

2020-11-27 10:44:23: [CRITICAL] JSONDecodeError('Unterminated string starting at: line 1 column 85919 (char 85918)')
2020-11-27 10:44:23: [CRITICAL] Traceback (most recent call last):
File "xxxxx", line 291, in main
data_paths = pull_data(tracker.last_data_dt, tracker.next_data_dt)
File "xxxxx", line 162, in pull_data
data_path = collector.execute_and_save()
File "xxxxx", line 226, in execute_and_save
for i, row in enumerate(cursor):
File "xxxxx", line 181, in _get_cursor
raise err
File "xxxxx", line 164, in _get_cursor
raise err
File "xxxxx", line 161, in _get_cursor
r = next(cursor)
File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 62, in g
return f(self, *args, **kwargs)
File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 320, in next
return next(self._results)
File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 370, in _stream_query
for row in rows_from_chunks(chunks):
File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 420, in rows_from_chunks
for row in json.loads(
File "/usr/lib64/python3.8/json/init.py", line 370, in loads
return cls(**kw).decode(s)
File "/usr/lib64/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 85919 (char 85918)

Any rows proceeding the {"x": "...\\"} either do not return data, or return a JSONDecodeError. I'm guessing this is because pydruid.db.api.rows_from_chunks tries to parse the JSON itself, and looks for "\\" as end of strings?

I have attached a script and a dummy JSON file (scratch.zip) that shows the rows being dropped by the function but this does not trigger the JSONDecodeError - this appears to only trigger when I try to read this row and the surrounding rows from the database.

Many thanks in advance

@dklei dklei changed the title JSONDecodeError('Unterminated string starting at: line 1 column 85919 (char 85918)') "\\" in row breaks PyDruid - JSONDecodeError('Unterminated string starting at: line 1 column 85919 (char 85918)') Nov 27, 2020
@dklei dklei changed the title "\\" in row breaks PyDruid - JSONDecodeError('Unterminated string starting at: line 1 column 85919 (char 85918)') "\\" in row breaks PyDruid - JSONDecodeError('Unterminated string...') Nov 27, 2020
@ahiijny ahiijny linked a pull request Feb 12, 2021 that will close this issue
@tvamsisai
Copy link

Hey, I'm hitting this issue. Is it possible to fix this soon?
This is quite a severe issue as it fails silently.

@gianm @mistercrunch

@ahiijny
Copy link

ahiijny commented Dec 12, 2023

This is still an issue in v0.6.6:

>>> from importlib.metadata import version
>>> version('pydruid')
'0.6.6'

To replicate:

from pydruid.db.api import rows_from_chunks

bad_json = """[
    {
        "id": 1,
        "value": "hi"
    },
    {
        "id": 2,
        "value": "C:\\\\"
    },
    {
        "id": 3,
        "value": "this row is missing..."
    }
]"""

for row in rows_from_chunks([bad_json]):
    print(f"row from bad json: {row}")

print("that's all!")

This prints:

row from bad json: OrderedDict([('id', 1), ('value', 'hi')])
that's all!

There are rows missing!

The suggested change in #262 seems to fix this problem. If I paste in the updated function definition from that PR and then rerun the above script, it prints the expected result:

row from bad json: OrderedDict([('id', 1), ('value', 'hi')])
row from bad json: OrderedDict([('id', 2), ('value', 'C:\\')])
row from bad json: OrderedDict([('id', 3), ('value', 'this row is missing...')])
that's all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants