Releases: apache/beam
Beam 2.51.0 release
We are happy to present the new 2.51.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.51.0, check out the detailed release notes.
New Features / Improvements
- In Python, RunInference now supports loading many models in the same transform using a KeyedModelHandler (#27628).
- In Python, the VertexAIModelHandlerJSON now supports passing in inference_args. These will be passed through to the Vertex endpoint as parameters.
- Added support to run
mypy
on user pipelines (#27906)
Breaking Changes
- Removed fastjson library dependency for Beam SQL. Table property is changed to be based on jackson ObjectNode (Java) (#24154).
- Removed TensorFlow from Beam Python container images PR. If you have been negatively affected by this change, please comment on #20605.
- Removed the parameter
t reflect.Type
fromparquetio.Write
. The element type is derived from the input PCollection (Go) (#28490) - Refactor BeamSqlSeekableTable.setUp adding a parameter joinSubsetType. #28283
Bugfixes
- Fixed exception chaining issue in GCS connector (Python) (#26769).
- Fixed streaming inserts exception handling, GoogleAPICallErrors are now retried according to retry strategy and routed to failed rows where appropriate rather than causing a pipeline error (Python) (#21080).
- Fixed a bug in Python SDK's cross-language Bigtable sink that mishandled records that don't have an explicit timestamp set: #28632.
Security Fixes
- Python containers updated, fixing CVE-2021-30474, CVE-2021-30475, CVE-2021-30473, CVE-2020-36133, CVE-2020-36131, CVE-2020-36130, and CVE-2020-36135
- Used go 1.21.1 to build, fixing CVE-2023-39320
Known Issues
- Python pipelines using BigQuery Storage Read API must pin
fastavro
dependency to 1.8.3
or earlier: #28811
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Adam Whitmore
Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Arvind Ram
Arwin Tio
BjornPrime
Bruno Volpato
Bulat
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Damon
Danny McCormick
David Cavazos
Dip Patel
Hai Joey Tran
Hao Xu
Haruka Abe
Jack Dingilian
Jack McCluskey
Jeff Kinard
Jeffrey Kinard
Joey Tran
Johanna Öjeling
Julien Tournay
Kenneth Knowles
Kerry Donny-Clark
Mattie Fu
Melissa Pashniak
Michel Davit
Moritz Mack
Pranav Bhandari
Rebecca Szper
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ruwann
Ryan Tam
Sam Rohde
Sereana Seim
Svetak Sundhar
Tim Grein
Udi Meiri
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Xinyu Liu
Yi Hu
Zbynek Konecny
Zechen Jiang
bzablocki
caneff
dependabot[bot]
gDuperran
gabry.wu
johnjcasey
kberezin-nshl
kennknowles
liferoad
lostluck
magicgoody
martin trieu
mosche
olalamichelle
tvalentyn
xqhu
Łukasz Spyra
Beam 2.50.0 release
We are happy to present the new 2.50.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.50.0, check out the detailed release notes.
Highlights
- Spark 3.2.2 is used as default version for Spark runner (#23804).
- The Go SDK has a new default local runner, called Prism (#24789).
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures.
I/Os
- Java KafkaIO now supports picking up topics via topicPattern (#26948)
- Support for read from Cosmos DB Core SQL API (#23604)
- Upgraded to HBase 2.5.5 for HBaseIO. (Java) (#27711)
- Added support for GoogleAdsIO source (Java) (#27681).
New Features / Improvements
- The Go SDK now requires Go 1.20 to build. (#27558)
- The Go SDK has a new default local runner, Prism. (#24789).
- Prism is a portable runner that executes each transform independantly, ensuring coders.
- At this point it supercedes the Go direct runner in functionality. The Go direct runner is now deprecated.
- See https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/README.md for the goals and features of Prism.
- Hugging Face Model Handler for RunInference added to Python SDK. (#26632)
- Hugging Face Pipelines support for RunInference added to Python SDK. (#27399)
- Vertex AI Model Handler for RunInference now supports private endpoints (#27696)
- MLTransform transform added with support for common ML pre/postprocessing operations (#26795)
- Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. (#27635)
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures. (#27674). The multi-arch container images include:
- All versions of Go, Python, Java and Typescript SDK containers.
- All versions of Flink job server containers.
- Java and Python expansion service containers.
- Transform service controller container.
- Spark3 job server container.
- Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).(#21429)
Breaking Changes
- Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
- Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the
--staging_location
at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.(#26996)
Deprecations
- The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
- Tests can be set back to the direct runner by overriding TestMain:
func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }
- It's recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
- Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
- Do not rely on closures or using package globals for DoFn configuration. They don't function on portable runners.
- Tests can be set back to the direct runner by overriding TestMain:
Bugfixes
- Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option
direct_num_workers!=1
.(#27373) - Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security (#27474)
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Abacn
acejune
AdalbertMemSQL
ahmedabu98
Ahmed Abualsaud
al97
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Anton Shalkovich
ArjunGHUB
Bjorn Pedersen
BjornPrime
Brett Morgan
Bruno Volpato
Buqian Zheng
Burke Davison
Byron Ellis
bzablocki
case-k
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Connor Brett
Damon
Damon Douglas
Dan Hansen
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmytro Sadovnychyi
Florent Biville
Gabriel Lacroix
Hai Joey Tran
Hong Liang Teoh
Jack McCluskey
James Fricker
Jeff Kinard
Jeff Zhang
Jing
johnjcasey
jon esperanza
Josef Šimánek
Kenneth Knowles
Laksh
Liam Miller-Cushon
liferoad
magicgoody
Mahmud Ridwan
Manav Garg
Marco Vela
martin trieu
Mattie Fu
Michel Davit
Moritz Mack
mosche
Peter Sobot
Pranav Bhandari
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Saba Sathya
Sam Whittle
Steven Niemitz
Steven van Rossum
Svetak Sundhar
Tony Tang
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Yichi Zhang
Yi Hu
Zechen Jiang
Beam 2.49.0 release
We are happy to present the new 2.49.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.49.0, check out the detailed release notes.
I/Os
- Support for Bigtable Change Streams added in Java
BigtableIO.ReadChangeStream
(#27183). - Added Bigtable Read and Write cross-language transforms to Python SDK ((#26593), (#27146)).
New Features / Improvements
- Allow prebuilding large images when using
--prebuild_sdk_container_engine=cloud_build
, like images depending ontensorflow
ortorch
(#27023). - Disabled
pip
cache when installing packages on the workers. This reduces the size of prebuilt Python container images (#27035). - Select dedicated avro datum reader and writer (Java) (#18874).
- Timer API for the Go SDK (Go) (#22737).
Deprecations
- Remove Python 3.7 support. (#26447)
Bugfixes
- Fixed KinesisIO
NullPointerException
when a progress check is made before the reader is started (IO) (#23868)
Known Issues
List of Contributors
According to git shortlog, the following people contributed to the 2.49.0 release. Thank you to all contributors!
Abzal Tuganbay
AdalbertMemSQL
Ahmed Abualsaud
Ahmet Altay
Alan Zhang
Alexey Romanenko
Anand Inguva
Andrei Gurau
Arwin Tio
Bartosz Zablocki
Bruno Volpato
Burke Davison
Byron Ellis
Chamikara Jayalath
Charles Rothrock
Chris Gavin
Claire McGinty
Clay Johnson
Damon
Daniel Dopierała
Danny McCormick
Darkhan Nausharipov
David Cavazos
Dip Patel
Dmitry Repin
Gavin McDonald
Jack Dingilian
Jack McCluskey
James Fricker
Jan Lukavský
Jasper Van den Bossche
John Casey
John Gill
Joseph Crowley
Kanishk Karanawat
Katie Liu
Kenneth Knowles
Kyle Galloway
Liam Miller-Cushon
MakarkinSAkvelon
Masato Nakamura
Mattie Fu
Michel Davit
Naireen Hussain
Nathaniel Young
Nelson Osacky
Nick Li
Oleh Borysevych
Pablo Estrada
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rouslan
Saadat Su
Sam Rohde
Sam Whittle
Sanil Jain
Shunping Huang
Smeet nagda
Svetak Sundhar
Timur Sultanov
Udi Meiri
Valentyn Tymofieiev
Vlado Djerek
WuA
XQ Hu
Xianhua Liu
Xinyu Liu
Yi Hu
Zachary Houfek
alexeyinkin
bigduu
bullet03
bzablocki
jonathan-lemos
jubebo
magicgoody
ruslan-ikhsan
sultanalieva-s
vitaly.terentyev
Beam 2.48.0 release
We are happy to present the new 2.48.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.48.0, check out the detailed release notes.
Note: The release tag for Go SDK for this release is sdks/v2.48.2 instead of sdks/v2.48.0 because of incorrect commit attached to the release tag sdks/v2.48.0.
Highlights
- "Experimental" annotation cleanup: the annotation and concept have been removed from Beam to avoid
the misperception of code as "not ready". Any proposed breaking changes will be subject to
case-by-case pro/con decision making (and generally avoided) rather than using the "Experimental"
to allow them.
I/Os
- Added rename for GCS and copy for local filesystem (Go) (#25779).
- Added support for enhanced fan-out in KinesisIO.Read (Java) (#19967).
- This change is not compatible with Flink savepoints created by Beam 2.46.0 applications which had KinesisIO sources.
- Added textio.ReadWithFilename transform (Go) (#25812).
- Added fileio.MatchContinuously transform (Go) (#26186).
New Features / Improvements
- Allow passing service name for google-cloud-profiler (Python) (#26280).
- Dead letter queue support added to RunInference in Python (#24209).
- Support added for defining pre/postprocessing operations on the RunInference transform (#26308)
- Adds a Docker Compose based transform service that can be used to discover and use portable Beam transforms (#26023).
Breaking Changes
- Passing a tag into MultiProcessShared is now required in the Python SDK (#26168).
- CloudDebuggerOptions is removed (deprecated in Beam v2.47.0) for Dataflow runner as the Google Cloud Debugger service is shutting down. (Java) (#25959).
- AWS 2 client providers (deprecated in Beam v2.38.0) are finally removed (#26681).
- AWS 2 SnsIO.writeAsync (deprecated in Beam v2.37.0 due to risk of data loss) was finally removed (#26710).
- AWS 2 coders (deprecated in Beam v2.43.0 when adding Schema support for AWS Sdk Pojos) are finally removed (#23315).
Bugfixes
- Fixed Java bootloader failing with Too Long Args due to long classpaths, with a pathing jar. (Java) (#25582).
List of Contributors
According to git shortlog, the following people contributed to the 2.48.0 release. Thank you to all contributors!
Abzal Tuganbay
Ahmed Abualsaud
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
Balázs Németh
Bazyli Polednia
Bruno Volpato
Chamikara Jayalath
Clay Johnson
Damon
Daniel Arn
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmitry Repin
George Novitskiy
Israel Herraiz
Jack Dingilian
Jack McCluskey
Jan Lukavský
Jasper Van den Bossche
Jeff Zhang
Jeremy Edwards
Johanna Öjeling
John Casey
Katie Liu
Kenneth Knowles
Kerry Donny-Clark
Kuba Rauch
Liam Miller-Cushon
MakarkinSAkvelon
Mattie Fu
Michel Davit
Moritz Mack
Nick Li
Oleh Borysevych
Pablo Estrada
Pranav Bhandari
Pranjal Joshi
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rouslan
RuiLong J
RyujiTamaki
Sam Whittle
Sanil Jain
Svetak Sundhar
Timur Sultanov
Tony Tang
Udi Meiri
Valentyn Tymofieiev
Vishal Bhise
Vitaly Terentyev
Xinyu Liu
Yi Hu
bullet03
darshan-sj
kellen
liferoad
mokamoka03210120
psolomin
Beam 2.47.0 release
We are happy to present the new 2.47.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.47.0, check out the detailed release notes.
Highlights
- Apache Beam adds Python 3.11 support (#23848).
I/Os
- BigQuery Storage Write API is now available in Python SDK via cross-language (#21961).
- Added HbaseIO support for writing RowMutations (ordered by rowkey) to Hbase (Java) (#25830).
- Added fileio transforms MatchFiles, MatchAll and ReadMatches (Go) (#25779).
- Add integration test for JmsIO + fix issue with multiple connections (Java) (#25887).
New Features / Improvements
- The Flink runner now supports Flink 1.16.x (#25046).
- Schema'd PTransforms can now be directly applied to Beam dataframes just like PCollections.
(Note that when doing multiple operations, it may be more efficient to explicitly chain the operations
likedf | (Transform1 | Transform2 | ...)
to avoid excessive conversions.) - The Go SDK adds new transforms periodic.Impulse and periodic.Sequence that extends support
for slowly updating side input patterns. (#23106) - Python SDK now supports
protobuf <4.23.0
(#24599) - Several Google client libraries in Python SDK dependency chain were updated to latest available major versions. (#24599)
Breaking Changes
- If a main session fails to load, the pipeline will now fail at worker startup. (#25401).
- Python pipeline options will now ignore unparsed command line flags prefixed with a single dash. (#25943).
- The SmallestPerKey combiner now requires keyword-only arguments for specifying optional parameters, such as
key
andreverse
. (#25888).
Deprecations
- Cloud Debugger support and its pipeline options are deprecated and will be removed in the next Beam version,
in response to the Google Cloud Debugger service turning down.
(Java) (#25959).
Bugfixes
- BigQuery sink in STORAGE_WRITE_API mode in batch pipelines might result in data consistency issues during the handling of other unrelated transient errors for Beam SDKs 2.35.0 - 2.46.0 (inclusive). For more details see: #26521
List of Contributors
According to git shortlog, the following people contributed to the 2.47.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Amir Fayazi
Amrane Ait Zeouay
Anand Inguva
Andrew Pilloud
Andrey Kot
Bjorn Pedersen
Bruno Volpato
Buqian Zheng
Chamikara Jayalath
ChangyuLi28
Damon
Danny McCormick
Dmitry Repin
George Ma
Jack Dingilian
Jack McCluskey
Jasper Van den Bossche
Jeremy Edwards
Jiangjie (Becket) Qin
Johanna Öjeling
Juta Staes
Kenneth Knowles
Kyle Weaver
Mattie Fu
Moritz Mack
Nick Li
Oleh Borysevych
Pablo Estrada
Rebecca Szper
Reuven Lax
Reza Rokni
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Saadat Su
Saifuddin53
Sam Rohde
Shubham Krishna
Svetak Sundhar
Theodore Ni
Thomas Gaddy
Timur Sultanov
Udi Meiri
Valentyn Tymofieiev
Xinyu Liu
Yanan Hao
Yi Hu
Yuvi Panda
andres-vv
bochap
dannikay
darshan-sj
dependabot[bot]
harrisonlimh
hnnsgstfssn
jrmccluskey
liferoad
tvalentyn
xianhualiu
zhangskz
Beam 2.46.0 release
We are happy to present the new 2.46.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.46.0, check out the detailed release notes.
Highlights
- Java SDK containers migrated to Eclipse Temurin
as a base. This change migrates away from the deprecated OpenJDK
container. Eclipse Temurin is currently based upon Ubuntu 22.04 while the OpenJDK
container was based upon Debian 11. - RunInference PTransform will accept model paths as SideInputs in Python SDK. (#24042)
- RunInference supports ONNX runtime in Python SDK (#22972)
- Tensorflow Model Handler for RunInference in Python SDK (#25366)
- Java SDK modules migrated to use
:sdks:java:extensions:avro
(#24748)
I/Os
- Added in JmsIO a retry policy for failed publications (Java) (#24971).
- Support for
LZMA
compression/decompression of text files added to the Python SDK (#25316) - Added ReadFrom/WriteTo Csv/Json as top-level transforms to the Python SDK.
New Features / Improvements
- Add UDF metrics support for Samza portable mode.
- Option for SparkRunner to avoid the need of SDF output to fit in memory (#23852).
This helps e.g. with ParquetIO reads. Turn the feature on by adding experimentuse_bounded_concurrent_output_for_sdf
. - Add
WatchFilePattern
transform, which can be used as a side input to the RunInference PTransfrom to watch for model updates using a file pattern. (#24042) - Add support for loading TorchScript models with
PytorchModelHandler
. The TorchScript model path can be
passed to PytorchModelHandler usingtorch_script_model_path=<path_to_model>
. (#25321) - The Go SDK now requires Go 1.19 to build. (#25545)
- The Go SDK now has an initial native Go implementation of a portable Beam Runner called Prism. (#24789)
- For more details and current state see https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/prism.
Breaking Changes
- The deprecated SparkRunner for Spark 2 (see 2.41.0) was removed (#25263).
- Python's BatchElements performs more aggressive batching in some cases,
capping at 10 second rather than 1 second batches by default and excluding
fixed cost in this computation to better handle cases where the fixed cost
is larger than a single second. To get the old behavior, one can pass
target_batch_duration_secs_including_fixed_cost=1
to BatchElements.
Deprecations
- Avro related classes are deprecated in module
beam-sdks-java-core
and will be eventually removed. Please, migrate to a new modulebeam-sdks-java-extensions-avro
instead by importing the classes fromorg.apache.beam.sdk.extensions.avro
package.
For the sake of migration simplicity, the relative package path and the whole class hierarchy of Avro related classes in new module is preserved the same as it was before.
For example, importorg.apache.beam.sdk.extensions.avro.coders.AvroCoder
class instead oforg.apache.beam.sdk.coders.AvroCoder
. (#24749).
List of Contributors
According to git shortlog, the following people contributed to the 2.46.0 release. Thank you to all contributors!
Ahmet Altay
Alan Zhang
Alexey Romanenko
Amrane Ait Zeouay
Anand Inguva
Andrew Pilloud
Brian Hulette
Bruno Volpato
Byron Ellis
Chamikara Jayalath
Damon
Danny McCormick
Darkhan Nausharipov
David Katz
Dmitry Repin
Doug Judd
Egbert van der Wal
Elizaveta Lomteva
Evan Galpin
Herman Mak
Jack McCluskey
Jan Lukavský
Johanna Öjeling
John Casey
Jozef Vilcek
Junhao Liu
Juta Staes
Katie Liu
Kiley Sok
Liam Miller-Cushon
Luke Cwik
Moritz Mack
Ning Kang
Oleh Borysevych
Pablo E
Pablo Estrada
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ruslan Altynnikov
Ryan Zhang
Sam Rohde
Sam Whittle
Sam sam
Sergei Lilichenko
Shivam
Shubham Krishna
Theodore Ni
Timur Sultanov
Tony Tang
Vachan
Veronica Wasson
Vincent Devillers
Vitaly Terentyev
William Ross Morrow
Xinyu Liu
Yi Hu
ZhengLin Li
Ziqi Ma
ahmedabu98
alexeyinkin
aliftadvantage
bullet03
dannikay
darshan-sj
dependabot[bot]
johnjcasey
kamrankoupayi
kileys
liferoad
nancyxu123
nickuncaged1201
pablo rodriguez defino
tvalentyn
xqhu
Beam 2.45.0 release
We are happy to present the new 2.45.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.45.0, check out the detailed release notes.
I/Os
- MongoDB IO connector added (Go) (#24575).
New Features / Improvements
- RunInference Wrapper with Sklearn Model Handler support added in Go SDK (#24497).
- Adding override of allowed TLS algorithms (Java), now maintaining the disabled/legacy algorithms
present in 2.43.0 (up to 1.8.0_342, 11.0.16, 17.0.2 for respective Java versions). This is accompanied
by an explicit re-enabling of TLSv1 and TLSv1.1 for Java 8 and Java 11. - Add UDF metrics support for Samza portable mode.
Breaking Changes
- Portable Java pipelines, Go pipelines, Python streaming pipelines, and portable Python batch
pipelines on Dataflow are required to use Runner V2. Thedisable_runner_v2
,
disable_runner_v2_until_2023
,disable_prime_runner_v2
experiments will raise an error during
pipeline construction. You can no longer specify the Dataflow worker jar override. Note that
non-portable Java jobs and non-portable Python batch jobs are not impacted. (#24515).
Bugfixes
- Avoids Cassandra syntax error when user-defined query has no where clause in it (Java) (#24829).
- Fixed JDBC connection failures (Java) during handshake due to deprecated TLSv1(.1) protocol for the JDK. (#24623)
- Fixed Python BigQuery Batch Load write may truncate valid data when deposition sets to WRITE_TRUNCATE and incoming data is large (Python) (#24623).
- Fixed Kafka watermark issue with sparse data on many partitions (#24205)
List of Contributors
According to git shortlog, the following people contributed to the 2.45.0 release. Thank you to all contributors!
AdalbertMemSQL
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Anand Inguva
Andrea Nardelli
Andrei Gurau
Andrew Pilloud
Benjamin Gonzalez
BjornPrime
Brian Hulette
Bulat
Byron Ellis
Chamikara Jayalath
Charles Rothrock
Damon
Daniela Martín
Danny McCormick
Darkhan Nausharipov
Dejan Spasic
Diego Gomez
Dmitry Repin
Doug Judd
Elias Segundo Antonio
Evan Galpin
Evgeny Antyshev
Fernando Morales
Jack McCluskey
Johanna Öjeling
John Casey
Junhao Liu
Kanishk Karanawat
Kenneth Knowles
Kiley Sok
Liam Miller-Cushon
Lucas Marques
Luke Cwik
MakarkinSAkvelon
Marco Robles
Mark Zitnik
Melanie
Moritz Mack
Ning Kang
Oleh Borysevych
Pablo Estrada
Philippe Moussalli
Piyush Sagar
Rebecca Szper
Reuven Lax
Rick Viscomi
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Whittle
Sergei Lilichenko
Seung Jin An
Shane Hansen
Sho Nakatani
Shunya Ueta
Siddharth Agrawal
Timur Sultanov
Veronica Wasson
Vitaly Terentyev
Xinbin Huang
Xinyu Liu
Xinyue Zhang
Yi Hu
ZhengLin Li
alexeyinkin
andoni-guzman
andthezhang
bullet03
camphillips22
gabihodoroaga
harrisonlimh
pablo rodriguez defino
ruslan-ikhsan
tvalentyn
yyy1000
zhengbuqian
v2.44.0
We are happy to present the new 2.44.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.44.0, check out the detailed release notes.
I/Os
- Support for Bigtable sink (Write and WriteBatch) added (Go) (#23324).
- S3 implementation of the Beam filesystem (Go) (#23991).
- Support for SingleStoreDB source and sink added (Java) (#22617).
- Added support for DefaultAzureCredential authentication in Azure Filesystem (Python) (#24210).
- Added new CdapIO for CDAP Batch and Streaming Source/Sinks (Java) (#24961).
- Added new SparkReceiverIO for Spark Receivers 2.4.* (Java) (#24960).
New Features / Improvements
- Beam now provides a portable "runner" that can render pipeline graphs with
graphviz. Seepython -m apache_beam.runners.render --help
for more details. - Local packages can now be used as dependencies in the requirements.txt file, rather
than requiring them to be passed separately via the--extra_package
option
(Python) (#23684). - Pipeline Resource Hints now supported via
--resource_hints
flag (Go) (#23990). - Make Python SDK containers reusable on portable runners by installing dependencies to temporary venvs (BEAM-12792).
- RunInference model handlers now support the specification of a custom inference function in Python (#22572)
- Support for
map_windows
urn added to Go SDK (#24307).
Breaking Changes
ParquetIO.withSplit
was removed since splittable reading has been the default behavior since 2.35.0. The effect of
this change is to drop support for non-splittable reading (Java)(#23832).beam-sdks-java-extensions-google-cloud-platform-core
is no longer a
dependency of the Java SDK Harness. Some users of a portable runner (such as Dataflow Runner v2)
may have an undeclared dependency on this package (for example using GCS with
TextIO) and will now need to declare the dependency.beam-sdks-java-core
is no longer a dependency of the Java SDK Harness. Users of a portable
runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.- Slices now use the Beam Iterable Coder. This enables cross language use, but breaks pipeline updates
if a Slice type is used as a PCollection element or State API element. (Go)#24339
Bugfixes
- Fixed JmsIO acknowledgment issue (Java) (#20814)
- Fixed Beam SQL CalciteUtils (Java) and Cross-language JdbcIO (Python) did not support JDBC CHAR/VARCHAR, BINARY/VARBINARY logical types (#23747, #23526).
- Ensure iterated and emitted types are used with the generic register package are registered with the type and schema registries.(Go) (#23889)
List of Contributors
According to git shortlog, the following people contributed to the 2.44.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alex Merose
Alexey Inkin
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrej Galad
Andrew Pilloud
Ayush Sharma
Benjamin Gonzalez
Bjorn Pedersen
Brian Hulette
Bruno Volpato
Bulat Safiullin
Chamikara Jayalath
Chris Gavin
Damon Douglas
Danielle Syse
Danny McCormick
Darkhan Nausharipov
David Cavazos
Dmitry Repin
Doug Judd
Elias Segundo Antonio
Evan Galpin
Evgeny Antyshev
Heejong Lee
Henrik Heggelund-Berg
Israel Herraiz
Jack McCluskey
Jan Lukavsk\u00fd
Janek Bevendorff
Johanna \u00d6jeling
John J. Casey
Jozef Vilcek
Kanishk Karanawat
Kenneth Knowles
Kiley Sok
Laksh
Liam Miller-Cushon
Luke Cwik
MakarkinSAkvelon
Minbo Bae
Moritz Mack
Nancy Xu
Ning Kang
Nivaldo Tokuda
Oleh Borysevych
Pablo Estrada
Philippe Moussalli
Pranav Bhandari
Rebecca Szper
Reuven Lax
Rick Smit
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Thompson
Sam Whittle
Sanil Jain
Scott Strong
Shubham Krishna
Steven van Rossum
Svetak Sundhar
Thiago Nunes
Tianyang Hu
Trevor Gevers
Valentyn Tymofieiev
Vitaly Terentyev
Vladislav Chunikhin
Xinyu Liu
Yi Hu
Yichi Zhang
AdalbertMemSQL
agvdndor
andremissaglia
arne-alex
bullet03
camphillips22
capthiron
creste
fab-jul
illoise
kn1kn1
nancyxu123
peridotml
shinannegans
smeet07
Beam 2.43.0 release
We are happy to present the new 2.43.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.43.0, check out the detailed release notes.
Highlights
- Python 3.10 support in Apache Beam (#21458).
- An initial implementation of a runner that allows us to run Beam pipelines on Dask. Try it out and give us feedback! (Python) (#18962).
I/Os
- Decreased TextSource CPU utilization by 2.3x (Java) (#23193).
- Fixed bug when using SpannerIO with RuntimeValueProvider options (Java) (#22146).
- Fixed issue for unicode rendering on WriteToBigQuery (#10785)
- Remove obsolete variants of BigQuery Read and Write, always using Beam-native variant
(#23564 and #23559). - Bumped google-cloud-spanner dependency version to 3.x for Python SDK (#21198).
New Features / Improvements
- Dataframe wrapper added in Go SDK via Cross-Language (with automatic expansion service). (Go) (#23384).
- Name all Java threads to aid in debugging (#23049).
- An initial implementation of a runner that allows us to run Beam pipelines on Dask. (Python) (#18962).
- Allow configuring GCP OAuth scopes via pipeline options. This unblocks usages of Beam IOs that require additional scopes.
For example, this feature makes it possible to access Google Drive backed tables in BigQuery (#23290). - An example for using Python RunInference from Java (#23290).
Breaking Changes
- CoGroupByKey transform in Python SDK has changed the output typehint. The typehint component representing grouped values changed from List to Iterable,
which more accurately reflects the nature of the arbitrarily large output collection. #21556 Beam users may see an error on transforms downstream from CoGroupByKey. Users must change methods expecting a List to expect an Iterable going forward. See document for information and fixes. - The PortableRunner for Spark assumes Spark 3 as default Spark major version unless configured otherwise using
--spark_version
.
Spark 2 support is deprecated and will be removed soon (#23728).
Bugfixes
- Fixed Python cross-language JDBC IO Connector cannot read or write rows containing Numeric/Decimal type values (#19817).
List of Contributors
According to git shortlog, the following people contributed to the 2.43.0 release. Thank you to all contributors!
Ahmed Abualsaud
AlexZMLyu
Alexey Romanenko
Anand Inguva
Andrew Pilloud
Andy Ye
Arnout Engelen
Benjamin Gonzalez
Bharath Kumarasubramanian
BjornPrime
Brian Hulette
Bruno Volpato
Chamikara Jayalath
Colin Versteeg
Damon
Daniel Smilkov
Daniela Martín
Danny McCormick
Darkhan Nausharipov
David Huntsperger
Denis Pyshev
Dmitry Repin
Evan Galpin
Evgeny Antyshev
Fernando Morales
Geddy05
Harshit Mehrotra
Iñigo San Jose Visiers
Ismaël Mejía
Israel Herraiz
Jan Lukavský
Juta Staes
Kanishk Karanawat
Kenneth Knowles
KevinGG
Kiley Sok
Liam Miller-Cushon
Luke Cwik
Mc
Melissa Pashniak
Moritz Mack
Ning Kang
Pablo Estrada
Philippe Moussalli
Pranav Bhandari
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Thompson
Ryohei Nagao
Sam Rohde
Sam Whittle
Sanil Jain
Seunghwan Hong
Shane Hansen
Shubham Krishna
Shunsuke Otani
Steve Niemitz
Steven van Rossum
Svetak Sundhar
Thiago Nunes
Toran Sahu
Veronica Wasson
Vitaly Terentyev
Vladislav Chunikhin
Xinyu Liu
Yi Hu
Yixiao Shen
alexeyinkin
arne-alex
azhurkevich
bulat safiullin
bullet03
coldWater
dpcollins-google
egalpin
johnjcasey
liferoad
rvballada
shaojwu
tvalentyn
What's Changed
- Use cloudpickle for Java Python transforms. by @robertwb in #23073
- clean up comments and register functional DoFn in wordcount.go by @pcoet in #23057
- [Tour Of Beam][backend] integration tests and GA workflow by @eantyshev in #23032
- [Test] Decrease derby.locks.waitTimeout in jdbc unit test by @Abacn in #23019
- Auto-cancel old unit test Actions Runs by @Abacn in #23095
- Cross-langauge tests in github actions. by @robertwb in #23092
- Update CHANGES.md for 2.42.0 cut, and add 2.43.0 section by @lostluck in #23108
- remove
"io/ioutil"
package by @zaneli in #23001 - Add one NER example to use a spaCy model with RunInference by @liferoad in #23035
- Bump google.golang.org/api from 0.94.0 to 0.95.0 in /sdks by @dependabot in #23062
- Implement JsonUtils by @damondouglas in #22771
- Support models returning a dictionary of outputs by @damccorm in #23087
- [TPC-DS] Store metrics into BigQuery and InfluxDB by @aromanenko-dev in #22545
- [Website] Update from-spark page table content overflow by @bullet03 in #22915
- [Website] update homepage mobile styles by @bullet03 in #22810
- Use a ClassLoadingStrategy that is compatible with Java 17+ by @cushon in #23055
- [Website] Update case-studies logo images by @bullet03 in #22793
- [Website] Update ctas button container on homepage by @bullet03 in #22498
- [Website] fix code tags content overflow by @bullet03 in #22427
- Clean up Kafka Cluster and pubsub topic in rc validation script by @Abacn in #23021
- Fix assertions in the Spanner IO IT tests by @BjornPrime in #23098
- [Website] update shortcode languages by @bullet03 in #22275
- Use existing pickle_library flag in expansion service. by @robertwb in #23111
- Assert pipeline results in performance tests by @Abacn in #23027
- Consolidate Samza TranslationContext and PortableTranslationContext by @mynameborat in #23072
- Improvements to SchemaTransform implementations for BQ and Kafka by @pabloem in #23045
- [TPC-DS] Use common queries argument for Jenkins jobs by @aromanenko-dev in #23139
- pubsublite: Reduce commit logspam by @dpcollins-google in #22762
- [GitHub Actions] - Added documentation in ACTIONS.md by @dannymartinm in #23159
- Bump dataflow java fnapi container version to beam-master-20220830 by @Abacn in #23183
- [Issue#23071] Fix AfterProcessingTime for Python to behave like Java by @InigoSJ in #23100
- Don't depend on java 11 docker container for go test by @kileys in #23197
- Properly close Spark (streaming) context if Pipeline translation fails by @mosche in #23204
- Annotate stateful VR test in TestStreamTest with UsesStatefulParDo (related to #22472) by @mosche in #23202
- [Playground] [Backend] Datastore queries and mappers to get precompiled objects by @vchunikhin in #22868
- Allow and test pyarrow 8.x and 9.x by @TheNeuralBit in #22997
- (BQ Python) Pass project field from options or parameter when writing with dynamic destinations by @ahmedabu98 in #23011
- Update python-machine-learning.md by @AnandInguva in #23209
- Pin the version of cloudpickle to 2.1.x by @tvalentyn in #23120
- Add streaming test for Write API sink by @AlexZMLyu in #21903
- [Go SDK] Proto changes for timer param by @riteshghorse in #23216
- Bump github.com/testcontainers/testcontainers-go from 0.13.0 to 0.14.0 in /sdks by @dependabot in #23201
- Update to objsize to 0.5.2 which is under BSD-3 license (fixes #23096) by @lukecwik in #23211
- Exclude insignificant whitespace from cloud object by @csteegz in #23217
- Trying out property-based tests for Beam python coders by @pabloem in #22233
- Publish results of JMH benchmark runs (Java SDK) to InfluxDB (#22238). by @mosche in #23041
- Exclude protobuf 3.20.2 by @Abacn in https://github.com/apache/beam/pull/...
Beam 2.42.0 release
We are happy to present the new 2.42.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.42.0, check out the detailed release notes.
Highlights
- Added support for stateful DoFns to the Go SDK.
New Features / Improvements
- Added support for Zstd compression to the Python SDK.
- Added support for Google Cloud Profiler to the Go SDK.
- Added support for stateful DoFns to the Go SDK.
Breaking Changes
- The Go SDK's Row Coder now uses a different single-precision float encoding for float32 types to match Java's behavior (#22629).
Bugfixes
- Fixed Python cross-language JDBC IO Connector cannot read or write rows containing Timestamp type values 19817.
Known Issues
- Go SDK doesn't yet support Slowly Changing Side Input pattern (#23106)
- See a full list of open issues that affect this version.
What's Changed
- Remove stripping of step name and replace with substring search by @AnandInguva in #22415
- [Website] Remove beam-summit 2022 by @bullet03 in #22444
- Add read/write PubSub integration example fhirio pipeline by @lnogueir in #22306
- [Go SDK]: Remove deprecated Session runner by @jrmccluskey in #22505
- Add Go test status to the PR template by @jrmccluskey in #22508
- Fix typo in Datastore V1ReadIT test by @yixiaoshen in #22484
- Remove unnecessary reference to use_runner_v2 experiment in x-lang examples and documentation by @chamikaramj in #22376
- Relax the google-api-core dependency. by @tvalentyn in #22513
- Bump google.golang.org/protobuf from 1.28.0 to 1.28.1 in /sdks by @dependabot in #22517
- Bump google.golang.org/api from 0.89.0 to 0.90.0 in /sdks by @dependabot in #22518
- Change _build import from setuptools to distutils by @AnandInguva in #22503
- Remove stringx package by @damccorm in #22534
- Improve concrete error message by @damccorm in #22536
- Exclude grpcio==1.48.0 by @tvalentyn in #22539
- Fix JDBCIOIT by @Abacn in #22304
- Update pytest to support Python 3.10 by @AnandInguva in #22055
- Update the imprecise link. by @tvalentyn in #22549
- Remove normalization in Pytorch Image Segmentation example by @yeandy in #22371
- Downgrade less informative logs during write to files by @Abacn in #22273
- Add zstd compression/decompression support by @grufino in #22419
- Beam ml notebooks by @AnandInguva in #22510
- [Go SDK]: Add clearer error message for xlang transforms on the Go Direct Runner by @jrmccluskey in #22562
- [CdapIO] Add integration tests for CdapIO (Batch) by @Amar3tto in #22313
- Bugfix: Fix broken assertion in PipelineTest by @mosche in #22485
- Mention Java RunInference support in the Website by @chamikaramj in #22557
- Update run_inference_basic.ipynb by @AnandInguva in #22567
- Update CHANGE.md after 2.41.0 cut by @Abacn in #22577
- Convert to BeamSchema type from ReadfromBQ by @svetakvsundhar in #17159
- Fix deleteTimer in InMemoryTimerInternals and enable VR tests for GroupIntoBatches. by @mosche in #22525
- Update Dataflow container version by @yeandy in #22580
- [22188]Set allowed timestamp skew by @reuvenlax in #22347
- Added experimental annotation to fixes #22564 by @ryanthompson591 in #22565
- [BEAM-14117, #21519] Delete vendored bytebuddy gradle build by @lukecwik in #22594
- Add Import transform to Go FhirIO by @lnogueir in #22460
- Moving misplaced CHANGES from template to 2.41.0 by @Abacn in #22581
- Allow unsafe triggers for python nexmark benchmarks by @y1chi in #22596
- pubsublite: Fix max offset for computing backlog by @dpcollins-google in #22585
- Add support when writing to locked buckets by handling retentionPolicyNotMet error by @ahmedabu98 in #22138
- [BEAM-14118, #21639] Vendor gRPC 1.48.1 by @lukecwik in #22607
- [21894] Validates inference_args early by @ryanthompson591 in #22282
- Return type for _ExpandIntoRanges DoFn should be Iterable. by @jonathanasdf in #22548
- Add PyDoc buttons to the top and bottom of the Machine Learning page by @rszper in #22458
- [Playground]: Modified WithKeys Playground Example by @VladMatyunin in #22326
- [Playground][Backend][Bug]: Moving the initialization of properties file by @vchunikhin in #22310
- [Playground] Remove Beam Summit banner from Playground by @miamihotline in #22410
- Bump cloud.google.com/go/bigquery from 1.36.0 to 1.37.0 in /sdks by @dependabot in #22598
- Minor: Clean up an assertion in schemas_test by @TheNeuralBit in #22613
- Exclude testWithShardedKeyInGlobalWindow on streaming runner v1 by @TheNeuralBit in #22593
- Add an example for
Distinct
PTransform by @shhivam in #22417 - Pub/Sub Schema Transform Read Provider by @damondouglas in #22145
- Update BigQuery URI validation to allow more valid URIs through by @TheMichaelHu in #22452
- Fix bug in StructUtils of SpannerIO by @manitgupta in #22429
- Add units tests for SpannerIO by @manitgupta in #22428
- Bump google.golang.org/api from 0.90.0 to 0.91.0 in /sdks by @dependabot in #22568
- Fix for #22631 KafkaIO considers readCommitted() as it would commit back the offsets, which it doesn't by @nbali in #22633
- [CdapIO] Add CdapIO dashboard in Grafana by @Amar3tto in #22641
- Fix retaining unsaved pipeline options (#22075) by @alexeyinkin in #22098
- Add information on how to take/close issues in the contribution guide. by @damccorm in #22640
- Removed VladMatyunin from beam collaborators by @olehborysevych in #22634
- Skip dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it by @yeandy in #22623
- Add stdlib distutils env variable while building the wheels by @AnandInguva in #22635
- Persist ghprbPullId parameter in seed job by @TheNeuralBit in #22579
- Adhoc: Fix logging in Spark runner to avoid unnecessary creation of strings by @mosche in #22638
- Improve exception when requested error tag does not exist (#22401) by @bvolpato in #22405
- Reimplement Pub/Sub Lite's I/O using UnboundedSource. by @dpcollins-google in #22612
- [Website] update contribution content collapse by @bullet03 in #22468
- Clean up checkstyle suppressions.xml by @Abacn in #22649
- [Playground] [Infrastructure] Uniform code style for python scripts by @vchunikhin in #22291
- Minor: Add helpful names for parameterized tests in
dataframe.schemas_test
by @TheNeuralBit in #22630 - [BEAM-14118, fixes #21639] Use vendored gRPC 1.48.1 by @lukecwik in #22628
- Change Python PostCommits timeout by @yeandy in #22655
- Revert "Persist ghprbPullId parameter in seed job" by @damccorm in #22656
- Bump actions/setup-java from 2 to 3 by @dependabot in #22666
- Bump actions/labeler from 3 to 4 by @dependabot in #22670
- Bump actions/setup-node from 2 to 3 by @dependabot in #22671
- Bump actions/setup-go from 2 to 3 by @dependabot in #22669
- Bump actions/setup-python from 2 to 4 by @dependabot in #22668
- Bump actions/checkout from 2 to 3 by @dependabot in #22667
- Fix broken link to Retry Policy blog by @nikhilnadig28 in https://gith...