Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with Cloudera Impyla #2

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from
Draft

Conversation

thyarles
Copy link
Member

No description provided.

jasonmfehr and others added 16 commits June 7, 2023 21:31
…515)

that are applicable to username/password auth and JWT auth are not
mixed together on the same call to the connect method.

These additional checks prevent confusion about which authentication
method is actually used for the connection.

New tests were added to cover the new checks.
Impyla gets cookies from an HTTMessage object formed from a
response to an HTTP message. The format of cookies in the message
differs across the python versions. In Python 2 the HTTPMessage is a
mimetools.Message object, and the Set-Cookie values all appear in a
single header, separated by newlines. In Python 3 the HTTPMessage is an
email.message.Message, and the Set-Cookie values appear as duplicate
headers.

Add platform dependent code to get_all_matching_cookies() that loads
cookies from all the Set-Cookie headers.

TESTING:
    Changed test_get_all_matching_cookies() to build the HTTPMessage
    using a new utility method that creates Set-Cookie headers in
    the appropriate format for the platform.

    I hand tested with a proxy that inserted 3 cookies into http
    responses. I added the 3 cookie names to the list of default
    cookies. I ran TestHttpConnect.test_simple_connect() connecting
    to Impala through the proxy and verified with the debugger that
    the cookies were returned correctly from
    get_all_matching_cookies() in both python2 and python3.
Current Usage part works well for Impala users but will fail for Hive users because of the `auth_mechanism` default value.
This adds a comment targeted towards Hive users so they can quick start too.
ImpalaService.thrift is updated to contain CloseImpalaOperation,
which can be used get the number of modified rows in DMLs.
This is not just a  copy, some parts of ImpalaService.thrift are
not included to avoid pulling in more Thrift files as dependencies.

Also updated process_thrift.sh to work with current Impala env vars.
sqlalchemy 2 (now default on pip in Python 3) removed some
functions used in tests. Updated these to work both with
sqlalchemy 2.* and 1.* (>=1.2).
* Support Cursor.rowcount and close finished queries

With current Impala server rowcount support needs DMLs to be
closed with CloseImpalaOperation() as there is no simpler way
to get the number of modifed rows.
See https://issues.apache.org/jira/browse/IMPALA-12647 for
alternatives.

This change adds option close_finished_queries for cursors
with default True. Setting it to False brings back the old
behavior.

If queries are closed after finishing queries, calling get_log
RPC is no longer possible. If close_finished_queries is true
then the logs are fetched and stored before closing to query
to be able to return the saved results with get_log. Generally
get_log shouldn't be a too expensive RPC.

Another potential side-effect is that get_profile may fail as
Impala can discard the runtime profile after the query is
closed (see Impala flag query_log_size).

Despite the above side effects closing the queries seems a better
default behavior as it helps avoiding queries hanging in the
"waiting to be closed" state and provides reliable rowcount. This
is also consistent with the way impala-shell works.

Testing:
- rowcount already had good coverage in DBAPI2 compliance tests
  (e.g. test_mixedfetch)
- new tests were added for some missing rowcount cases and for
  getting warning/error log for closed queries

* Fix review comments
The old version used deprecated functions that were
removed in Python 3.12.

The change only contains code generated by:
versioneer install
Python 3.12 removed deprecated certfile and key_file
arguments from http_client.HTTPSConnection. These should
be always empty in Impyla as the server is never verified
in https connections (see #362).
* Add tox.ini to help testing with multiple python versions

* Revert "Update versioneer to 0.29 (needed for Python 3.12) (#532)"

This reverts commit b98ffef.
Copy link
Member Author

@thyarles thyarles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

csringhofer and others added 3 commits April 15, 2024 17:41
This function is called for every query during normal execution, making
this info level too verbose.
* Add text() wrapper for metadata queries.

Remove tablename from retrieve columnname results.

* Update sqlalchemy.py

remove tablename from get_columns result.

* replace 'r' in re.sub argument
@thyarles thyarles marked this pull request as draft April 16, 2024 16:10
@thyarles thyarles added the wontfix This will not be worked on label Apr 16, 2024
WWakker and others added 7 commits August 7, 2024 19:50
Adds wildcard ('*') support to the `http_cookie_names` connect property
to preserve all cookies returned by the server. Preserves prior behavior
for any other value of `http_cookie_names`.
* Avoid retrying non-idempotent RPCs in binary connections (#549)

See the #549 for the detailed analyses of the issue.

The fix works similarly to the existing solution for http connections:
- each RPC knows whether it is idempotent
- if the error comes from establishing the connection, then retry
- if the error comes from executing the RPC, only retry if the RPC
  is idempotent

A test is added that relies on slow metadata handling in the
Impala cluster to trigger timouts. It would be nice to add wider
and more reliable tests in the future similarly to the http tests
in test_hs2_fault_injection.py

* Fix review comments

* Fix review comment
The goal is to support "long poll" (IMPALA-13294). When query option
long_polling_time_ms is set, the impala server will wait in
GetOperationStatus for this time (or until the query status changes).
This allows detecting earlier that a query has finished without making
GetOperationStatus RPCs more frequent.

If long_polling_time_ms is not used then the effect should be minor -
GetOperationStatus is quick RPC so the time it takes should mainly
come from network delay.

_get_sleep_interval() is not changed (min 0.01s, max 1s) to avoid
regression in existing use cases. It could be useful to override
this in a later patch based on the value of long_polling_time_ms.
Supported Python versions are also updated in setup.py.
The issue was introduced in #542.
Caught by Impala's LdapImpylaHttpTest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.