-
Notifications
You must be signed in to change notification settings - Fork 888
-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proiel parser exhibits odd behaviour with respect to punctuation #1311
Comments
Certainly this sucks, but the problem here is with the training data, and I'm not sure how we can fix it. The PROIEL dataset has zero (!) instances of either commas or periods. One thing I just found is that the Perseus dataset has commas and a period analog which appears to be halfway up the line of text compared to a US period. For example, the first sentence looks like
It would appear the XPOS tags are not remotely similar, but perhaps you could take a look to see if the general annotation quality is similar. Are the tokenization, lemmatization, dependency standards the same... we could probably mix the two if they are, or maybe you'd just get better results from switching to Perseus |
I wondered if this had been the case, indeed. Which is odd given that PROIEL included biblical edition text. I haven't looked into how feasible it would be to interconvert the treebanks and train on a mixture of both the sources. Or to use one of the sources as a pre-training task but not a fine-tuning task, assuming that the stanza models behave like other language models in this regard. So far I've used them as black box algorithms. |
I am indeed now using Perseus — but especially since PROIEL is the default package in stanza for Ancient Greek, I thought this was worth noting. |
@AngledLuffa Reading the docs at https://stanfordnlp.github.io/stanza/new_language.html it looks like unlabelled text is only good for improving NER/Sentiment/Constituency parsing and not for any of the tasks I'm using (tokenize, lemma, POS, depparse). Is that actually the case? |
I would say that if the other annotations are of similar formalities, they would wind up benefitting the model by giving it more words it knows about and/or examples of unusual phenomenon. The small things I need to do in a short amount of time are kinda adding up, but long term I do think switching the default to Perseus and then exploring using data from both to make a "combined" model is probably the best approach here. |
I feel like in the long run it would be nice to be able to put a standard-architecture language model in there and have the stanza training script do the fine-tuning on that. I'm thinking especially of things like dbamman/latin-bert here (Latin is also a language that I need to support). |
We actually do exactly that for some languages, with the https://huggingface.co/pranaydeeps/Ancient-Greek-BERT (feel free to suggest other options) If you want, I can give that a try with Ancient Greek, but again, I'm up to my ears in small things that need doing and can't really commit to doing it for a few weeks. |
Well, I could give it a whirl if you can point me at docs on how to do the fine-tuning and plumbing it into the system; this is stuff I need for work so I feel like I should at least try to contribute! I'm aware of the microbert models; they're nice and fast to train (and they're what I'm working on using for Coptic), so if they work, this would be generally applicable to most of the stanza languages. |
Basically you just need to go through the retraining instructions with the flags I actually found in some limited experiments that finetuning the transformer itself for POS didn't help given the complexity of the inference head we use. We've had some recent success anyway finetuning for constituency parsing or coref with LoRA or with careful experimentation for the finetuning method. However, the calendar for expanding that to other models is "after I get out from under this crushing TODO list" or "after I can scam an undergrad @Jemoka into doing it" |
Hello, I am that undergrad and I'd love to look into it this weekend. @pseudomonas, @AngledLuffa do you think I can be more helpful starting with—
As @AngledLuffa said, Bert support is pretty good but I don't think has been done for this area yet. Though, if one of the two packages work, perhaps it will be more interesting to look into training/LoRAing a Transformer on the task instead of getting a better model simply by combining the two sets. |
@Jemoka I was thinking refactoring the usage of Peft and giving it a try on the POS or depparse would both be interesting and useful, especially once we wrap up the Coref usage of Peft Certainly as a baseline, switching to Perseus and experimenting with a few of the above models to see which works best would give a better model for short term usage |
Combining the treebanks seems like, if it can be done, it will provide benefits; and a BERT can presumably be added on top of that at a later point. But I don't know how compatible the annotation guidelines of the two projects are. |
@Jemoka I think in terms of improving performance longer-term across Stanza, being able to leverage BERT-integration would be good. I'm probably going to try @AngledLuffa's suggestion #1311 (comment) in any case. I'm not sure how this corresponds (either in terms of performance or in terms of mechanism) to fine-tuning a BERT to perform the task directly. |
Sounds good. @pseudomonas Feel free to start with the Bert work there, and I can start on the PEFT a large model end that @AngledLuffa mentioned and do Greek POS first as a test case. And hopefully you can end up with a good model in the short term and we can release an adapter that performs even better in the long term. LMK if you run into anything with Bert tuning. |
@AngledLuffa if I'm training a model and the training is interrupted, what's the command-line flags for "resume training starting with this saved checkpoint?" |
If it's giving you the message that the model already exists, you can overwrite the existing model with |
I'll have results later this morning for the Perseus POS trained on a few different Ancient Greek transformers. I can also do the same thing for depparse, and there's even time to include those models in the upcoming 1.7.0 release. I don't have time over this weekend to built a pretrained charlm (probably from something like https://figshare.com/articles/dataset/The_Diorisis_Ancient_Greek_Corpus/6187256) but that can be an action item for later. |
@AngledLuffa I will start over the weekend to PEFT for POS and depparse taking a hopefully good pretrained Bert as a starting point. Once you explore some ancient greek transformers don't hesisate to lmk what you would recommend; I will also dig into this a little later on my own. |
it took my little computer over a day to reproduce the benchmark, so I might try running the BERT one on my work's cluster with GPUs… |
yes, running on GPUs would make this process a lot faster; also, the upcoming work of PEFT (in theory, results/benchmarks TBD) should also make inference a smidge faster because its multiplying less parameters. |
So far, I would say the |
As listed above, there's a few Ancient Greek transformers available on HF. Here are the dev scores on the POS & depparse tasks
https://huggingface.co/altsoph/bert-base-ancientgreek-uncased could not use because of this error: https://huggingface.co/altsoph/bert-base-ancientgreek-uncased/discussions/2 So based on those scores, I made the |
You will probably want to use a GPU for the My takeaway from the rest of this thread is that there are a few separate directions for improvement still:
At any rate, I don't think any of these are immediate TODOs, so hopefully we've improved the situation enough for now and we can leave the issue open in anticipation of future improvements. |
Your baseline scores (Model==None) are rather higher than those on https://stanfordnlp.github.io/stanza/performance.html assuming that POS is XPOS rather than UPOS; that page has UPOS = 92.41; XPOS = 85.13 ; LAS=73.97. |
Those are test scores, these are dev scores. Didn't seem fair to pick a model based on how good they do on the test set The POS score is weighted of upos, xpos, and feats |
@AngledLuffa I wonder if we trained the model with our new fangles EoS Punct augmentation it will also do better even on the persus dataset |
I believe that is the default now
…On Sun, Dec 3, 2023, 12:27 PM Houjun Liu ***@***.***> wrote:
@AngledLuffa <https://github.com/AngledLuffa> I wonder if we trained the
model with our new fangles EoS Punct augmentation it will also do better
even on the persus dataset
—
Reply to this email directly, view it on GitHub
<#1311 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWPTQ6IV4EJD2YO2IULYHTODXAVCNFSM6AAAAAA74MYBL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGU4TANRSG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I've found a different but related issue with both perseus and proiel parsers, which is that they perform incredibly badly with accents stripped out (they do things like processing definite articles and the most common adverbs as nouns). Is there a way of using the data augmentation that is used to make them tolerant of line-final punctuation to make them tolerant of absence of accents? My use-case for the parsers is processing manuscripts that lack accents. The code I'm using is just def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) not in ('Mn')) though this might want some refinement so that iotas-subscript are randomly either removed or replaced by a normal iota. I'm also wondering about whether the data being unicode-decomposed before training would help it generalise. |
I can see how that would be a problem. However, how correct will be able to make it if we use a pretrained embedding or even a transformer? The tokens / tokenizer will have the accents as well, I would think. What about cases where multiple different texts with accents map to the same text without accents? Nevertheless, if you think it will help, I don't see any reason we can't provide a model like that using the augmentation mechanism. |
Good points! I should first try it out with one of the transformer models and see if that provides enough experience of unaccented texts to cause it to generalise. |
I can also train the entire thing with your conversion, then see how its scores are doing. If there isn't a big dropoff, then I guess no reason not to do a model with that conversion for GRC |
If you've got the time and computational resources to do that, it would certainly be appreciated! |
Minor point to be aware of, the conversion sometimes makes words completely empty in the Perseus training set. If I use just the word vectors, the model trained on no accents gets the following dev score:
Its performance on the accented version is suitably horrible:
The original does better than this:
and the original has a similar huge dropoff in quality when used on the non-accented data:
I can try training a model on a straight mix of both accented and unaccented, then see where that gets us |
is this due to the fact that the accents have no morpheme-level learned by BPE (yet?) As in—the model basically treats accented versions as individual characters, and so we see catastrophic forgetting of the originial embedding |
I would hazard a guess that running a Unicode decomposition before training would help it learn the relationship between accented and unaccented letters. |
That was just using word vectors, not the various transformers |
If I train on both, there's actually a noticeable dropoff in the dev score vs. the original dataset:
I guess one thing that might help would be to use the word vectors for the words with diacritics in place of the word vectors for words w/o when that w/o word doesn't exist. Would you explain a bit more why this is necessary? Under what circumstances is this relevant to the processing? Always, or is it just that some domains have this problem? Also, should we be experimenting with this for all of the annotators, not just POS? One possibility would be to provide versions of the Perseus parser w/ and w/o the unaccented words |
I'm processing transcriptions of manuscripts that lack diacritics information (in most cases the manuscript lacks; in some cases the manuscript has but it has not been transcribed). There's a wrinkle in that iotas subscript are in some manuscripts omitted (so
My use case is certainly to use features beyond POS, including dependency relationships. I'm looking into whether the |
The accuracy is enough of a hit that I wouldn't want to make this the default for general usage, but I can see making it available as an optional package. I'll run some tests on the transformer models as well. I wonder if diacritic restoration would be a worthwhile project. |
If I train (the transformer POS model) on no-diacritics, I get this:
Trained on both, I get
Trained on just the dataset with accents:
So, no matter what, there's some kind of hit in quality. Maybe the existing |
Describe the bug
If there is a comma in the parsed sentence, the PROIEL model:
a) does not tokenize the comma, it just bundles it with the preceding word. The lemma is affected similarly.
b) if the comma is space-delimited, it does unpredictable (to me!) things up to and including tagging it as a verb with a lemma of ὁράω.
The fullstop/period is correctly tokenized, but is still never identified as punctuation. There does not seem to be any POS tag corresponding to punctuation emitted by the PROIEL model; the full list of tags on parsing a corpus is
ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PRON PROPN SCONJ VERB
.To Reproduce
Results from those commas that were somehow parsed individually, from parsing the text of the Nestlé 1904 edition of the New Testament. They have been passed through
sort -u
to deduplicate them.Expected behavior
As Perseus: commas should be tokenised separately from the preceding word; both commas and fullstops should be annotated as punctuation
Environment (please complete the following information):
1.6.1
Additional context
Running under Jupyter within PyCharm
The text was updated successfully, but these errors were encountered: