feat(python): Add `DataFrame.write_iceberg` #15018

kevinjqliu · 2024-03-12T16:06:21Z

Resolves #14610

This PR introduces the write_iceberg function to the Dataframe API. write_iceberg requires a pyiceberg.table.Table object and a mode (either append or overwrite).

Note, partitioned write is currently not supported (blocked on iceberg-python #208)

Constructing the Iceberg catalog and Table object is left outside of this function. The test (test_write_iceberg) provides an example of creating an in-memory catalog and a Iceberg Table object.

Example

import polars as pl
df = pl.DataFrame(
    {
        "foo": [1, 2, 3, 4, 5],
        "bar": [6, 7, 8, 9, 10],
        "ham": ["a", "b", "c", "d", "e"],
    }
)

table_path = "/tmp/path/to/iceberg-table/"

# create catalog and table
from pyiceberg.catalog.sql import SqlCatalog
catalog = SqlCatalog(
    "default", uri="sqlite:///:memory:", warehouse=f"file://{table_path}"
)
catalog.create_namespace("foo")
table = catalog.create_table(
    "foo.bar",
    schema=df.to_arrow().schema,
)

df.write_iceberg(table, mode="overwrite")
print(table.location())

# read back the iceberg table 
pl.scan_iceberg(table).collect()

Files created

(.venv) ➜  py-polars git:(kevinjqliu/iceberg-write) tree /tmp/path/to/iceberg-table/                  
/tmp/path/to/iceberg-table/
└── default.db
    └── table
        ├── data
        │   └── 00000-0-3a2c5080-0e03-4dc5-a21a-73b8423ae181.parquet
        └── metadata
            ├── 00000-4ed15e3e-cafb-436c-b0ec-3bd08e6079ab.metadata.json
            ├── 00001-47503a3a-c2d9-4024-8605-28e92b6448c0.metadata.json
            ├── 3cd50fc2-83c6-4e5d-a68a-c34b56418631-m0.avro
            └── snap-3013071076850462756-0-3cd50fc2-83c6-4e5d-a68a-c34b56418631.avro

4 directories, 5 files

codecov · 2024-03-12T17:21:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.89%. Comparing base (dddf0b7) to head (d2f64d2).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #15018   +/-   ##
=======================================
  Coverage   79.88%   79.89%           
=======================================
  Files        1513     1513           
  Lines      203466   203471    +5     
  Branches     2892     2893    +1     
=======================================
+ Hits       162546   162555    +9     
+ Misses      40372    40369    -3     
+ Partials      548      547    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kevinliu-stripe · 2024-03-28T21:05:50Z

waiting for pyiceberg 0.6.1 release, should be released soon, which will fix the schema issue described above

mkarbo · 2024-05-03T06:37:42Z

@kevinliu-stripe fyi 0.6.1 has been released now :)

https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.6.1

stinodego

Thanks for the PR. A few comments.

stinodego · 2024-07-02T20:48:45Z

py-polars/polars/dataframe/frame.py

@@ -3834,6 +3835,32 @@ def unpack_table_name(name: str) -> tuple[str | None, str | None, str]:
            msg = f"unrecognised connection type {connection!r}"
            raise TypeError(msg)

+    def write_iceberg(
+        self,
+        table: pyiceberg.table.Table,


Why does this method take a Table and not a Pathlike, like other write functions?

Writing to an iceberg table requires creating the pyiceberg.table.Table object, which also requires an iceberg catalog.
I think it's best to delegate the creation of the catalog and table to be outside of this function and just pass in the table object.

The original implementation takes a str/Pathlike representing the "warehouse location" for the iceberg table. In Iceberg world, writing to a table usually requires an update to the catalog.

stinodego · 2024-07-02T20:49:13Z

py-polars/polars/dataframe/frame.py

+        data = self.to_arrow()
+
+        if mode == "append":
+            table.append(data)
+        else:
+            table.overwrite(data)


This logic is so simple that this does not warrant its own method, in my opinion.

The main goal is to implement the top-level write_iceberg function for the dataframe API. The actual implementation code can be moved somewhere else

The point is that, if users already have an Iceberg table object, they can just write table.append(df.to_arrow). It's even shorter than df.write_iceberg(table, mode="append"). So there is not much added value to a write_iceberg method.

If there is some complex logic required to set up the iceberg table, or if to_arrow is not sufficient to handle all data types correctly, we can consider adding a write_iceberg to save all users some implementation hassle. But right now it doesn't seem warranted.

That is true. Pyiceberg is well integrated with arrow. With an arrow dataframe and a PyIceberg table object, one can just invoke the pyiceberg write functions .append/.overwrite.

Given the above, is there still value in implementing a simple write_iceberg method?

Looking at the write_delta method, aside from the merge functionality, the function just pass data as arrow into the write_deltalake function.

If there is some complex logic required to set up the iceberg table

There's a scenario where a user might want to write an Iceberg table to a location in blob store. In such case, the write_iceberg function can take care of creating an in-memory catalog and iceberg table object before writing.

That was the initial version of this PR
0801012

Seems like a write_iceberg would be worth it just by virtue of having full support from polars / parity with delta.

stinodego · 2024-07-04T08:00:29Z

py-polars/tests/unit/io/test_iceberg.py

+    df = pl.DataFrame(
+        {
+            "foo": [1, 2, 3, 4, 5],
+            "bar": [6, 7, 8, 9, 10],
+            "ham": ["a", "b", "c", "d", "e"],
+        }
+    )


This needs much more extensive testing. Try using the df fixture (it has a mix of many data types), or define your own dataframe with lots of different data types.

Thanks! I'll take a look at this fixture

stinodego

See comments. Needs testing on more different data types before this can be merged.

glesperance · 2024-07-23T03:41:43Z

This is great. Excited to try this out

mkarbo · 2024-08-20T11:54:20Z

Excited to try this one out!

kevinjqliu · 2024-09-12T22:08:43Z

hey @stinodego thanks for the previous review, could you take another look at this PR?

I've changed the test to use the df test fixture.
Currently, pyiceberg does not support time64[ns] and returning the dtype as Enum or Categorical. Hopefully, we can add those incrementally.

kevinjqliu mentioned this pull request Mar 12, 2024

Schema issue when writing new delta tables - parquet schema not valid delta lake schema #9795

Open

2 tasks

kevinjqliu force-pushed the kevinjqliu/iceberg-write branch from 9bff9b8 to 0801012 Compare July 2, 2024 17:51

kevinjqliu marked this pull request as ready for review July 2, 2024 19:04

kevinjqliu requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli and reswqa as code owners July 2, 2024 19:04

kevinjqliu changed the title ~~[WIP] Write support for Iceberg~~ Write support for Iceberg Jul 2, 2024

kevinjqliu mentioned this pull request Jul 2, 2024

PyIceberg Near-Term Roadmap apache/iceberg-python#736

Open

39 tasks

stinodego reviewed Jul 2, 2024

View reviewed changes

stinodego changed the title ~~Write support for Iceberg~~ feat(python): Add DataFrame.write_iceberg Jul 4, 2024

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jul 4, 2024

stinodego reviewed Jul 4, 2024

View reviewed changes

stinodego requested changes Jul 4, 2024

View reviewed changes

ritchie46 force-pushed the main branch from 0a696ff to 9c29683 Compare July 28, 2024 08:11

kevinjqliu added 7 commits September 12, 2024 12:09

update to pyiceberg 0.6.1

269485d

first pass

9804c7f

add mode

da0d002

simplify writing to iceberg table

0f4f582

pyiceberg 0.7.1

03667ba

fmt

4b9ade1

use polars df fixture

62a923a

kevinjqliu force-pushed the kevinjqliu/iceberg-write branch from 1fef815 to 62a923a Compare September 12, 2024 20:24

add pyiceberg issue

d2f64d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Add `DataFrame.write_iceberg` #15018

feat(python): Add `DataFrame.write_iceberg` #15018

kevinjqliu commented Mar 12, 2024 •

edited

Loading

codecov bot commented Mar 12, 2024 •

edited

Loading

kevinliu-stripe commented Mar 28, 2024

mkarbo commented May 3, 2024

stinodego left a comment

stinodego Jul 2, 2024

kevinjqliu Jul 2, 2024

stinodego Jul 2, 2024

kevinjqliu Jul 2, 2024

stinodego Jul 4, 2024

kevinjqliu Jul 4, 2024 •

edited

Loading

kevinjqliu Jul 4, 2024

glesperance Jul 23, 2024

stinodego Jul 4, 2024

kevinjqliu Jul 4, 2024

stinodego left a comment

glesperance commented Jul 23, 2024

mkarbo commented Aug 20, 2024

kevinjqliu commented Sep 12, 2024

feat(python): Add DataFrame.write_iceberg #15018

Are you sure you want to change the base?

feat(python): Add DataFrame.write_iceberg #15018

Conversation

kevinjqliu commented Mar 12, 2024 • edited Loading

Example

Files created

codecov bot commented Mar 12, 2024 • edited Loading

Codecov Report

kevinliu-stripe commented Mar 28, 2024

mkarbo commented May 3, 2024

stinodego left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stinodego left a comment

Choose a reason for hiding this comment

glesperance commented Jul 23, 2024

mkarbo commented Aug 20, 2024

kevinjqliu commented Sep 12, 2024

feat(python): Add `DataFrame.write_iceberg` #15018

feat(python): Add `DataFrame.write_iceberg` #15018

kevinjqliu commented Mar 12, 2024 •

edited

Loading

codecov bot commented Mar 12, 2024 •

edited

Loading

kevinjqliu Jul 4, 2024 •

edited

Loading