Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow workers use private IP addresses #630

Open
samanvp opened this issue Jun 30, 2020 · 3 comments
Open

Dataflow workers use private IP addresses #630

samanvp opened this issue Jun 30, 2020 · 3 comments
Assignees

Comments

@samanvp
Copy link
Member

samanvp commented Jun 30, 2020

Currently in our docs one of the resources we ask users to increase their quota is In-use IP addresses. Because Dataflow is able to use private IP addresses (see use_public_ips) we might be able to remove this limitation. In that case we can use private IP addresses by default.

The only issue we need to verify is that you have to have "enable private google access" enabled on the GCP network or it's enabled by default.

This was suggessted by @kemp-google during #624 review.

@samanvp samanvp self-assigned this Jun 30, 2020
@samanvp
Copy link
Member Author

samanvp commented Jul 17, 2020

From this page:

  • With Private Google Access, VMs that have only internal IP addresses can access select public IPs for Google Cloud and services.
  • Jobs that access APIs and services outside of Google Cloud require internet access. For example, Python SDK jobs need access to the Python Package Index (PyPI).
  • Read Managing Python Pipeline Dependencies on the Apache Beam website for more details.

It seems people have resolved this issue by utilizing a NAT gateway for their Dataflow workers.

@mbookman
Copy link
Contributor

Has the use of a NAT gatework with gcp-variant-transforms been tested?
I tried using --use_public_ips false from the pipeline runner (after enabling Private Google Access) and ran into the problem @samanvp noted above - PyPI access failed:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection
broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 
0x7fa88eb5f6d0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/nose/

We are submitting a request for more in-use public IP address quota.

However our current use is for the annotations pipeline. I didn't see any setting of usePrivateAddress so it looks like the setting of use_public_ips is not propagated there.

@moschetti
Copy link
Member

@mbookman I tried adding the router & NAT and was able to successfully run the vcf_to_bq with no public IPs.

Although, you're right with the annotation pipeline not getting that flag, so won't resolve that use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants