-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Contribution: DateTime Periodicity Encoder #415
Comments
Having something that can pick up a datetime object sure sounds pratical to me. A lot of folks mentioned they were at times confused by the RadialBasis trick. Just so I understand the |
Clear. But then I have one other question; is there a reason to create both the sine and cosine columns? Why both? Also, got an exact use-case for this? One concern I have with this, compared to the RepeatingBasisFunctions is that it might only be able to describe one specific seasonality shape. |
Each component captures 50% of the information. I attempted to illustrate it in the diagram below. I don't completely understand what you mean by being able to only describe one specific seasonality? This can be applied to the different aspects I named in my description. Or is that not what you mean? |
Let us consider the example from the docs. When you use your tool to generate features for such a dataset. Does it really help a model? I'm wondering if your features can help in situations where the shape you're trying to fit is not a "perfect sine". That's what I mean with "a specific seasonality". Most seasonal patterns that I've seen don't really fit a sine wave. It's usually something like "high in summer, zero in winter" or something else that's smoothly repeating ... but not a sine wave. |
But is this sine wave you show above not showing some other variable that moves in time? This encoding I suggest is about encoding time itself to capture information about when in a particular cycle something happened. It is a more appropriate alternative for simply extracting the "hour of day" or "day of the week" as an integer from a time stamp. Therefore, my use of this would be more for flat datasets as opposed to time series. I think indeed for time series the application is limited. A more concrete use case: a security system logs events with a particular type and we might want to our model to learn which type of events occur late at night or in the wee hours of night. 1-24 would not suffice here. |
So the thing that @tbezemer has here is an order 1 fourier series, which is indeed not super expressive in and of itself. That said, an order 2 fourier series can easily replicate @koaning s pattern above: Maybe it makes sense to include the order as a hyperparameter. That way it is also more in line with the RBF encoder that we already have? I would consider renaming it to |
@MBrouns that was indeed the direction I was thinking of. Doing that seems very sensible and I'd certainly welcome a PR with that feature. |
There seems to be a bit of radio silence. @tbezemer are you still interested in implementing this? |
Yes, definitely! I chatted with Matthijs about this a week ago. It's high up on my to do list. To be continued. |
I have extended the transformer to accept a non-zero, positive integer parameter So for the aspect 'hour', for n_periods = 2 of a datetime with 'hour' component '1am', the respective calculations will be as follows: Is this what you meant? |
Both of you since you seemed to be on the same page about extending the transformer in this way, but @MBrouns and I specifically discussed this part, so perhaps it is easier for him to way in on this? |
I think I'm cool with having a transformer for periodicity but before you start the PR we can save a lot of review time if we can discuss the signatures of the transformer here first. Mainly on my end; @tbezemer could you describe the full input of the object? Maybe list a few examples that demonstrate the main usecases? Once we're agreed on that the implementation should be very straightforward. |
Certainly!: This transformer allows a user to decompose timestamps into their sine and cosine components for each def __init__(self, aspects=None, n_periods=1):
"""- aspects can be one of : ["second", "minute", "hour", "weekday", "day", "month"].
If None specified, the whole list is used.
- n_periods is a non-zero and positive integer. For each period 'p' from 1 to n_periods,
a new set of sine/cosine transformations is produced for each of the passed aspects,
having periodicity aspect_periodicity / p"""
def fit(self, X, y=None):
"""Fit function only sets trailing underscore variables and saves shape of X
(e.g. self.n_periods_ and self.aspects_"""
def transform(self, X, y=None):
"""- Where X is an np.array. Otherwise assumes a pd.Dataframe and tries to extract X.values
- X is then checked for conformity to the expected datatype: datetime64.
- Applies an extractPeriodicFeatures(...) function to each column in the array.
- returns transformed X where X consists of pairs of sine/cosine transformations
for each aspect, for each period from 1 to self.n_periods_""" Use case: In my opinion, the added value of this transformer is:
Let me know if you need for information! |
A few points on my end.
|
Ad 1. Yes, indeed! You can pass a list as well. Ad 2. We can definitely change Ad 3. Yeah, that is definitely another way of doing it. I thought that the sklearn convention was to only use trailing underscore variables in the transform step, to ensure that attributes have not changed since first fit, but I see how copying them into a differently named attribute seems a bit redundant. I can change that as per your suggestion! Ad 4. I hope this image makes it clearer. Forgive my poor handwriting :) . @MBrouns suggested this to me to also allow for higher frequency effects within the total period of each aspect. That, or maybe I horribly misunderstood what he meant, haha. |
I've implemented a
DateTimePeriodicityEncoder
. It is a scikit-learn encoder for datetime features that uses sine and cosine transformations to capture periodicity in datetimes. This type of transformation ensures that an algorithm can learn that 23 hours is close to 00 hours, minute 60 is close to minute 1, etc.It can be used to capture different "aspects" of a datetime (e.g. minute-in-hour, hour-of-day, day-of-week, day-of-month) as such:
For each of the aspects, it returns two new columns containing the respective sine and cosine transformations.
I have written unit tests and it passes the scikit-learn
check_estimator
(with some tags).@MBrouns asked me to create and issue and tag you, @koaning, to see if this could be a useful contribution for
scikit-lego
. If so, I can submit a pull request.The text was updated successfully, but these errors were encountered: