Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling unigram and bigram features at the same time in word2features #137

Open
AbhishekBose opened this issue Dec 24, 2021 · 0 comments
Open

Comments

@AbhishekBose
Copy link

Hello,
I am trying to perform an NER experiment on a custom dataset containing a lot of food items.
I have labels for certain unigrams and bigrams for my training data.

My label corpus contains "green chilli" = "vegetable". I don't have "chilli" as a label
I am using this label list in order to annotate sentences for NER.

For example:

A sentence might contain a bigram such as "green chilli" with it's associated label = "vegetable"

Currently while generating the features, I am marking both "green" and "chilli" as "vegetable".
My annotation pipeline is as follows:

  • Split sentence into unigrams
  • Check if unigram exists in label list -> If label exists mark unigram with label
  • Get bigram by considering token + sentence[idx+1] or token + sentence[idx-1]
  • Check if bigram exists in label corpus -->> mark both token and sentence[idx+1] or sentence[idx-1] with that label

As a result of point number 4, both green and chilli get marked as vegetable

So when I train my model and run inference on a test sentence containing "green chilli", I would get "vegetable", "vegetable" twice.

What would be the best way to annotate this using word2features?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant