Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Spark-compatible CAST from String to Decimal #325

Open
andygrove opened this issue Apr 25, 2024 · 5 comments · May be fixed by #615
Open

Implement Spark-compatible CAST from String to Decimal #325

andygrove opened this issue Apr 25, 2024 · 5 comments · May be fixed by #615
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

What is the problem the feature request solves?

We currently delegate to DataFusion when casting from string to decimal and there are some differences in behavior compared to Spark.

  • An input of 4e7 produces 40000000.00 in Spark, and null in DataFusion
  • Inputs of ., -, + and empty string produce null in Spark, and 0.0 in DataFusion
  • Input of 0 produces 0 in Spark, and null in DataFusion
  • Arrow-rs does not support negative scale (Cannot cast string to decimal with negative scale). We could choose to fallback to Spark for this use case (or if SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED is enabled)

Describe the potential solution

No response

Additional context

I used the following test in CometCastSuite to explore this.

  test("cast string to decimal") {
    val values = generateStrings(numericPattern, 5).toDF("a")
    castTest(values, DataTypes.createDecimalType(10, 2))
    castTest(values, DataTypes.createDecimalType(10, 0))
    withSQLConf((SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key, "true")) {
      castTest(values, DataTypes.createDecimalType(10, -2))
    }
  }

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added the enhancement New feature or request label Apr 25, 2024
@andygrove andygrove added good first issue Good for newcomers help wanted Extra attention is needed labels Apr 25, 2024
@kevinmingtarja
Copy link

Hi, I'd like to contribute to this!

@viirya
Copy link
Member

viirya commented Apr 26, 2024

Thanks @kevinmingtarja.

You can take a look at @andygrove's PR as a reference #307

@kevinmingtarja
Copy link

Current state for reference:

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val inputs = Seq("", "0", "1", "+1.0", ".34", "-10.0", "4e7").toDF("n")
inputs: org.apache.spark.sql.DataFrame = [n: string]

scala> inputs.write.mode("overwrite").parquet("test.parquet")
24/04/28 11:46:51 INFO src/lib.rs: Comet native library initialized
                                                                                
scala> val df = spark.read.parquet("test.parquet")
df: org.apache.spark.sql.DataFrame = [n: string]

scala> val df2 = df.withColumn("converted", col("n").cast(DataTypes.createDecimalType(10, 2)))
df2: org.apache.spark.sql.DataFrame = [n: string, converted: decimal(10,2)]

scala> df2.show
+-----+---------+
|    n|converted|
+-----+---------+
|-10.0|   -10.00|
| +1.0|     1.00|
|  .34|     0.34|
|  4e7|     null|
|    1|     1.00|
|    0|     0.00|
|     |     0.00|
+-----+---------+


scala> spark.conf.set("spark.comet.enabled", false)

scala> df2.show
+-----+-----------+
|    n|  converted|
+-----+-----------+
|-10.0|     -10.00|
| +1.0|       1.00|
|  .34|       0.34|
|  4e7|40000000.00|
|    1|       1.00|
|    0|       0.00|
|     |       null|
+-----+-----------+

Note: I encountered a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional on the second df2.show, but it seems like a known issue and can safely be ignored: https://kb.databricks.com/scala/decimal-is-fractional-error

@sujithjay
Copy link
Contributor

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

@kevinmingtarja
Copy link

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

Hey, i don't think i have the bandwidth rn to complete this, so please feel free to work on it. I have made some progress here on a branch in my fork, so feel free to take inspirations from there as well if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants