Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding LocalDateRange #190

Open
PeterAttardo opened this issue Mar 9, 2022 · 17 comments
Open

Adding LocalDateRange #190

PeterAttardo opened this issue Mar 9, 2022 · 17 comments
Milestone

Comments

@PeterAttardo
Copy link

Per request I am opening an issue to explain #189

My use case comes from my recent exploration of the new Kotlin Spark API. In data engineering, a lot of batch jobs are parameterized with startDate and endDate. Processing the latest date of data in a daily task runner (such as Airflow) is done by feeding the same date as both startDate and endDate; processing a backfill is done by feeding an appropriate range to (re)process.

For some jobs it is necessary to iterate over every date in the range. This could be required if the relevant data files are organized by date in the S3/HDFS/etc path (s3://some-bucket/data_type/{date}/[00..n].parquet), or if the job calls a process that is parameterized by a single date, or another reason.

As mentioned, there is a similar request for Instant Interval #34. I contemplated adding something similar in my pull request (specifically I considered doing for LocalDateTime what I did for LocalDate), but decided against it for the following reasons:

  • The pattern used by the builtin Kotlin progressions requires a default step size. Integers, Longs, and Chars all have a natural value for that step, because they are discrete. LocalDate is likewise discrete, and therefore a progression with a step size of one day is a natural fit. LocalDateTime is not discrete, and therefore there is no natural step size. Should (dateTime1..dateTime2).forEach { ... } run for every second between the two LocalDateTimes? Every hour? Every nanosecond?
  • Without Kotlin's Progression concept being a good fit for LocalDateTime, there is still a valid use case for the general range semantics. However, most of that use case is already covered by the generic ClosedRange<T : [Comparable]<T>> that Kotlin gives you out of the box for any Comparable. (dateTime1..dateTime2).contains(dateTime3) is already provided free of charge.
  • Arithmetic on LocalDateTime is not provided. The reasons for this are likely the same reasons that a LocalDateTime progression is a bad idea.

The third reason does not apply to Instant, but the first two reasons do. If others disagree that they are sufficient to preclude a use case for an InstantRange that applies Kotlin's progression semantics, I can add it to my pull request. But for now I omitted it.

@dkhalanskyjb
Copy link
Collaborator

Is it correct to say that the use case is that the API you're working with requires querying data date by date with no option to request the data for a wider range?

@PeterAttardo
Copy link
Author

That would be correct.

@leviathan-n
Copy link

I'm doing this by adding some extensions for ClosedRange<LocalDate> currently
Hope this be accepted

@dkhalanskyjb
Copy link
Collaborator

dkhalanskyjb commented Apr 3, 2024

I've browsed through GitHub to see how people are using kotlinx-datetime, and it's clear that this is an important use case (I've seen several implementations of date progressions), and we should implement this. There's a question that we need to address first, though: what constitutes a step between two dates?

  • (The approach taken in Local date range #189) A step is an arbitrary DatePeriod, where months and days can have opposite signs. This leads to some issues already solved in the PR, but also some new ones:
    • When I write date1.rangeTo(date2).step(step) and date2 is later than date1, I don't expect to receive dates earlier than date1, but this can happen;
    • DateTimePeriod(months = 1, days = -30) is considered positive, so (LocalDate(2024, 1, 31)..LocalDate(2024, 1, 30)).step(DatePeriod(months = 1, days = -30)) is empty, even though LocalDate(2024, 1, 31) + DatePeriod(months = 1, days = -30) == LocalDate(2024, 1, 30); that is, actually attempting to iterate over the range produces the same result.
  • A step is an arbitrary DatePeriod, but months and days are required to have the same sign.
    • This limits some of the use cases, but I haven't seen anyone anywhere needing them. Question to the audience: does anyone need date progressions whose steps is some number of months minus some number of days?
    • This is slightly non-idiomatic. DatePeriod(months = 12, days = -1) does seem useless, but it's unclear why passing this particular DatePeriod is forbidden when passing other DatePeriod values is fine.
    • With this, we can easily ensure that iterating never fails. (1..2).step(Int.MAX_VALUE).toList() == listOf(1), not a crash and not some value corrupted by numeric overflows; if we know that a given DatePeriod always acts in the same direction, we can ensure the same behavior for dates.
  • A multiple of DateTimeUnit.DateBased: (date1..date2).step(6, DateTimeUnit.MONTH)
    • This limits some use cases even further. Question to the audience: does anyone need date progressions where the period between dates needs to have both non-zero months and non-zero days, for example, DatePeriod(months = 1, days = 2)? My immediate assumption, given the use cases I've actually seen mentioned in this repo and implemented in the wild, is that no, no one actually needs this.
    • We can ensure that constructing ranges and progressions will never fail with an exception. This is a huge win.
    • How do we represent the step in the progression object?
      • As a DatePeriod that's guaranteed to be some multiple of one DateTimeUnit.DateBased.
        • This can actually result in numeric overflow: (date1..date2).step(Int.MAX_VALUE, DateTimeUnit.MONTH * 12), for example. However, instead of failing, we can just consider the step to be a DatePeriod(months = Int.MAX_VALUE). It's 12 times less than the value that was actually passed in, but the resulting range should be the same in all cases: either empty if date2 < date1 or listOf(date1) otherwise.
        • It's a bit strange to take out what you didn't put. That said, accessing the step is probably a very advanced use case, so this shouldn't cause much confusion in practice.
      • As two fields, like stepScalar: Int and stepUnit: DateTimeUnit.DateBased.
      • Both? stepAsPeriod, stepScalar, and stepUnit?

@PeterAttardo
Copy link
Author

Related to the choice of arbitrary DatePeriod:

I'm not hugely concerned about the first case. A range that goes backward temporarily, but forward ultimately, isn't wrong so much as it is perhaps unexpected. Given that LocalDate arithmetic, with anything other than days, is already a bit of a "here be dragons" proposition, I would expect anyone using complex or compound DatePeriods to be aware of the pitfalls already. For example the implicit rounding down of dates that don't occur in a month means that LocalDate(2001, 1, 31) + DatePeriod(months=1) + DatePeriod(months=-1) yields 2001-01-28; and the fact that arithmetic is done on months prior to days means that LocalDate(2001, 1, 31) + DatePeriod(months=-1, days=31) yields 2001-01-31 whereas if the arithmetic order were flipped it would yield 2001-02-03. Users have to be aware of the underlying assumptions that give rise to those results. Given that the temporarily-backward range is both an uncommon usecase, and one for which the library already assumes a level of user sophistication, I don't think it will present a meaningful problem.

The second case is more interesting. It did occur to me that there could be DatePeriods that behaved differently over an arbitrarily long period than they did over a shorter period, and that treating them as the former may prohibit their use in cases of the latter. It's a bit like a highway that in general trends South to North, but for short sections trends North to South. In those sections are you still heading "North" on the highway, since that's the ultimate direction? Or are you heading South, because that's the instantaneous direction? In terms of date iteration, is DatePeriod(months = 1, days = -30) positive because if you sum it enough times, it will increase? Or is it positive or negative based on the specific months and/or days it is being added to? A related question is how a comparison operation for DatePeriods would work. Does it even make mathematical sense to compare DatePeriods? As a practical matter, I don't see a way to predict these cases without actually conducting the iteration. Empty ranges could only be determined through an exhaustive iteration, which would be exceedingly computationally expensive.

I think ultimately restricting DatePeriods to having the same sign is the cleanest way to guarantee these issues are avoided, but if someone presents a usecase for having mixed signs, that these issues aren't severe enough to restrict that valid usecase.

As a matter of syntax, for step to be an infix function, it can only take a single parameter. So having the step be a DatePeriod is nicer than constructing a scalar/unit pair, or abandoning the infix notation altogether.

@PeterAttardo
Copy link
Author

Thinking about this more, there is an issue that persists in any choice that allows for a step in months (as opposed to just days, or just years*):

If one were to create the following date progression: LocalDate(2001, 1, 31)..LocalDate(2002, 1, 31) step DatePeriod(months=1), would we expect it to yield the 31st of each month, excepting months with 30 days, where it would yield the 30th, and February, where it would yield the 28th? Or would we expect it to yield the 31st of January, and then the 28th of every month thereafter (as once it rounds down to the 28th of February, it stays at the 28th of each month when it progresses from one month to the next)?

The latter is the current behavior with simple DatePeriod arithmetic, as it arises naturally from how month addition works in the library. It feels off, but likely only because I'm (mis)interpreting the progression as being equivalent to "the 31st of each month" when it is not. It may be the case that certain "of each month" progressions are best achieved by creating a stable progression (first of the month, step size of one month, for example), and then modifying each resulting date within the scope function. I don't know if that's a natural first assumption for most users though.

The more I think about it, the more I believe that any progression that allows months inherently comes with caveats. Even if we restrict the steps to only positive months, we would still have caveats like the one above. It may make sense to trend toward maximum freedom, and just put strong warnings in the docs about known quirks.

*technically this would also occur if one tried to iterate, by year, starting on the 29th of February, as one encountered non-leap years, but that's an edge case of an edge case

@dkhalanskyjb
Copy link
Collaborator

As a matter of syntax, for step to be an infix function, it can only take a single parameter. So having the step be a DatePeriod is nicer than constructing a scalar/unit pair, or abandoning the infix notation altogether.

It is okay to abandon the infix notation: it's consistent with things like LocalDate(2023, 1, 30).plus(5, DateTimeUnit.DAY), where we can't use the infix notation for plus for the same reason.

LocalDate(2001, 1, 31)..LocalDate(2002, 1, 31) step DatePeriod(months=1), would we expect it to yield the 31st of each month, excepting months with 30 days, where it would yield the 30th, and February, where it would yield the 28th? Or would we expect it to yield the 31st of January, and then the 28th of every month thereafter (as once it rounds down to the 28th of February, it stays at the 28th of each month when it progresses from one month to the next)?

That's a great point. Actually, we can implement this in a way that the last day of the month is always returned (treat this as pseudocode, I neither compiled it nor checked it for edge cases):

sequence {
  var i = 0
  while (true) {
    val nextDate = startDate.plus(i, stepUnit)
    if (nextDate > endDate) break
    yield(nextDate)
    ++i
  }
}

We only have to think carefully about what's less unintuitive / more likely to be useful.

@PeterAttardo
Copy link
Author

I think we can distill this problem down to the following question:

Which of the following accurately describes the next date in a progression?:

  1. The prior date in the progression + the step
  2. The first date in the progression + (the step * the "index" of the date)

The end-of-month edge case is one we've already uncovered where the distinction is meaningful. Are there any others?

@dkhalanskyjb
Copy link
Collaborator

  • DateTimeUnit.MONTH: this irregularity only occurs at the end of the month. If the target month contains the day-of-the-month of the source date, then the day-of-the-month is preserved.
  • DateTimeUnit.DAY: there's no distinction at all: date.plus(n, DateTimeUnit.DAY).plus(m, DateTimeUnit.DAY) == date.plus(n + m, DateTimeUnit.DAY).
  • DatePeriod: adding together two DatePeriod values seems semantically meaningless, we can't go down this path at all: addition of DatePeriod is defined to be "first add months, then, add days", so first adding DatePeriod(months = m1, days = d1) and then DatePeriod(months = m2, days = d2) yields date.plus(m1, DateTimeUnit.MONTH).plus(d1, DateTimeUnit.DAY).plus(m2, DateTimeUnit.MONTH).plus(d2, DateTimeUnit.DAY). Additions of days and months can't be swapped, so this is not the same at all as adding DatePeriod(months = m1 + m2, days = d1 + d2). Defining addition on date periods would be a stretch on its own, and including it only in date ranges would mean that one can't use the knowledge of the rest of the library to predict how they behave.

@PeterAttardo
Copy link
Author

Addition on DatePeriods is already defined, is it not? https://github.com/Kotlin/kotlinx-datetime/blob/master/core/common/src/DateTimePeriod.kt#L445

Multiplication could be defined largely the same way:

public operator fun DatePeriod.times(other: Int): DatePeriod = DatePeriod(
    safeMultiply(totalMonths, other),
    safeMultiply(days, other)
)

I think this is ultimately a question of user expectation. Which of the following is more intuitive to a user and less likely to cause confusion:

range[N] = range[N-1] + step
range[N] = range[0] + step * N

@dkhalanskyjb dkhalanskyjb added this to the 0.9.0 milestone Apr 12, 2024
@dkhalanskyjb
Copy link
Collaborator

Addition on DatePeriods is already defined, is it not?

Oh, good point, filed an issue: #381

@dkhalanskyjb
Copy link
Collaborator

I've made some time to research this a bit more. It looks like people don't actually want iteration steps other than day-based, so it's a waste of time to get stuck on how months behave. Here's the document detailing the findings and the proposed API: https://gist.github.com/dkhalanskyjb/3758db55f37cfdedc7901151f8838384 During an internal discussion with the team, everyone agreed to sticking to DateTimeUnit.DayBased for now. It's not clear at this point whether to expose the step and how to do that in a forward-compatible way (we don't want to cut off the option to support DatePeriod as the step later down the line), but this may become clearer after we implement what we know.

@PeterAttardo, are you interested in adapting your pull request to match this API, or would you like us to implement it?

@dkhalanskyjb dkhalanskyjb modified the milestones: 0.9.0, 0.7.0 Sep 2, 2024
@PeterAttardo
Copy link
Author

I'll take a crack at it. I should have time starting tomorrow, so I'll take a deeper look at it then.

@PeterAttardo
Copy link
Author

After reading through the linked gist:

I think being able to control the size of the step is a hard requirement, and as long as the steps are in days, it should be fairly straightforward to accomplish. I can think of a few ways to accomplish it:

  1. Use DatePeriod and throw a not-yet-implemented error if the month component is non-zero: date1..date2 step DatePeriod(days = 7)

    pros:

    • Allows step to be an infix function
    • Maintains future compatibility with steps of arbitrary DatePeriods

    cons:

    • Fails at runtime instead of compile time
  2. Use a raw Int/Long and assume it represents days: date1..date2 step 7

    pros:

    • Allows step to be an infix function
    • Tersest possible syntax

    cons:

    • Relies on an assumed unit instead of an explicit one
    • May be even more ambiguous and/or clash if DatePeriod or months are supported down the road
  3. Use a multi-parameter step function with an Int/Long and an instance of DayBased: (date1..date2).step(7, DateTimeUnit.DAY)

    pros:

    • Explicit in its units

    cons:

    • Cannot be used as an infix function
    • Quite verbose given that the second parameter will almost always be DateTimeUnit.DAY
  4. Use a more descriptive function name for step: date1..date2 stepInDays 7

    pros:

    • All the pros of the raw Int/Long approach, but without the ambiguity

    cons:

    • Breaks the progression convention of having the infix function be named step
  5. Introduce a DayPeriod class: date1..date2 step 7.days

    pros:

    • Most of the pros of the DatePeriod method, but with an explicit days-only guarantee
    • Extension function that converts directly from Int/Long (as seen above) can be written

    cons:

    • Creates a new class that somewhat duplicates the purpose of DatePeriod

Do we have a preference for which of the options I should go with?

@PeterAttardo
Copy link
Author

Second question:

public fun LocalDateProgression.first(): LocalDate
public fun LocalDateProgression.firstOrNull(): LocalDate?
public fun LocalDateProgression.last(): LocalDate
public fun LocalDateProgression.lastOrNull(): LocalDate?

public fun LocalDateProgression.reversed(): LocalDateProgression

These five functions from the spec are given to us for free* because LocalDateProgression is an Iterable. However, because they are generic functions on iterables, they are unaware that progressions pre-specify their end values, and therefore iterate through the entire progression in order to accomplish the latter three. Are we ok with that inefficiency, or do we want to re-implement these methods to take advantage of the fact that LocalDateProgression knows the destination before iterating?

* reversed() technically returns a List rather than a LocalDateProgression, but the method is there.

@dkhalanskyjb
Copy link
Collaborator

Do we have a preference for which of the options I should go with?

Yes, please see https://gist.github.com/dkhalanskyjb/3758db55f37cfdedc7901151f8838384#api-shape :

// in any case, we'll need these overloads
public infix fun LocalDateProgression.step(
    value: Long,
    unit: DateTimeUnit.DayBased,
): LocalDateProgression

public infix fun LocalDateProgression.step(
    value: Int,
    unit: DateTimeUnit.DayBased,
): LocalDateProgression

Consistency with plus requires that we add these functions. Maybe we could change both plus and step at a later time to improve their usage syntactically, but it's out of this issue's scope.

Are we ok with that inefficiency, or do we want to re-implement these methods to take advantage of the fact that LocalDateProgression knows the destination before iterating?

We want to re-implement them, as is done for the other ranges: see https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.ranges/ for a list. By clicking (source) on the specific functions' pages (like https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.ranges/first-or-null.html), you can discover how this is implemented for the other ranges. Example: https://github.com/JetBrains/kotlin/blob/eb6e6b7b526f06ff72b56de3c7949d91f84d50e6/libraries/stdlib/common/src/generated/_Ranges.kt#L75

@PeterAttardo
Copy link
Author

I have updated the pull request. Two things I figured were of note:

  1. LocalDateProgression does not expose step. Internally it tracks steps in days, but in terms of the public API I didn't want to expose that assumption while the final definition of a step remains unsettled. Other progressions expose step however, so I wanted to make note of its absence.
  2. The random() functions that take lambdas as parameters probably want to be inline, but would require the delegated intProgression to be made public in order to do so. I looked at several approaches for the random functionality, and while all were viable, they all came with some tradeoff, whether it be exposing intProgression publicly, exposing step publicly, or an un-inlined lambda parameter. In the end I prioritized cleanliness of the public API, but this might be worth revisiting once the definition of step settles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@PeterAttardo @leviathan-n @dkhalanskyjb and others