Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add float list features in ItemSequence #38

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

napsternxg
Copy link

@napsternxg napsternxg commented Jun 10, 2016

  • New class FloatFeatures defined in pycrfsuite._float_features
  • Useful for adding word embedding features
  • Wrap the word embedding list of float values with FloatFeatures class
  • The api supports the existing extraction of features from nested
    dicts.
  • Added test case in tests/test_itemsequence.py

Example usage:

import pycrfsuite
from pycrfsuite._float_features import FloatFeatures as FF
seq = pycrfsuite.ItemSequence([
  {"w2v": FF([1., 2., 3.])},
  {"w2v": FF([-1., 5, 4.])}
])

Fixes #37
Closes #39

- New class FloatFeatures defined in `pycrfsuite._float_features`
- Useful for adding word embedding features
- Wrap the word embedding list of float values with FloatFeatures class
- The api supports the existing extraction of features from nested
  dicts.
- Added test case in `tests/test_itemsequence.py`

Example usage:
```
import pycrfsuite
from pycrfsuite._float_features import FloatFeatures as FF
seq = pycrfsuite.ItemSequence([
  {"w2v": FF([1., 2., 3.])},
  {"w2v": FF([-1., 5, 4.])}
])
```
@kmike
Copy link
Member

kmike commented Jun 10, 2016

Thanks for taking look at it!

I think it'd be nice to support numpy arrays and/or stdlib arrays instead of creating a custom wrapper, because this is a format you usually get embeddings in in a first place. If user has a list of floats it is easy to convert to array.array if needed.

numpy support is a bit more problematic because it may require a numpy dependency, which I'd like to avoid in python-crfsuite, but maybe it is possible to implement it somehow using memoryviews.

Also, could you please add generated cpp files in a separate commit, or just remove them? Travis tests should pass even without generated cpp files because they are generated on travis before each test run.

@napsternxg
Copy link
Author

@kmike I was also thinking of using numpy array but decided against it once I realized numpy was not a dependency in crfsuite. I don't think using numpy array will be beneficial as everything eventually gets mapped to the ItemSequence object using the crfsuite_api.

My rationale behind using the custom wrapper was to allow for the only other possible feature which CRFSuite can take and that would be list of floats. If there is any other type of list then it should always be incompatible. This explicit mapping will allow for word vectors and other vector embedding (say thought vectors from encoder-decoder) to be used as a native type in CRFSuite. I might be wrong here but this is what I could understand from my brief overview of python-CRFSuite code.

Array.array sounds like a good idea (I have never used these), but I am not sure if that would work as the user will have to eventually wrap their arrays into array.array.

I didn't know about not generating CPP files. I think if that was the case then the CPP files should be in .gitignore, what do you think ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants