{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":162150877,"defaultBranch":"main","name":"robotoff","ownerLogin":"openfoodfacts","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2018-12-17T15:25:27.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/1937790?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1726647028.0","currentOid":""},"activityList":{"items":[{"before":"043a5dbee4cb7bfc69bf9c89a176bbcb4f83f035","after":"eb580767ac6919249e081ffbc34a213d787acc40","ref":"refs/heads/gh-pages","pushedAt":"2024-09-18T08:11:04.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"github-actions[bot]","name":null,"path":"/apps/github-actions","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/15368?s=80&v=4"},"commit":{"message":"Deploying to gh-pages from @ openfoodfacts/robotoff@64d83f4ccae110c38d39bf2dc2398d8bd72fc027 🚀","shortMessageHtmlLink":"Deploying to gh-pages from @ <a class=\"commit-link\" data-hovercard-type=\"commit\" data-hovercard-url=\"https://github.com/openfoodfacts/robotoff/commit/64d83f4ccae110c38d39bf2dc2398d8bd72fc027/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/commit/64d83f4ccae110c38d39bf2dc2398d8bd72fc027\"><tt>64d83f4</tt></a> 🚀"}},{"before":"055ce19b50cdaa00810b9edfa469c017505c45a3","after":null,"ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-18T08:10:00.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"}},{"before":"2748324b40aab6f5c11d31e03d24ac31de93e1af","after":"64d83f4ccae110c38d39bf2dc2398d8bd72fc027","ref":"refs/heads/main","pushedAt":"2024-09-18T08:09:59.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"chore(main): release 1.52.1 (#1416)","shortMessageHtmlLink":"chore(main): release 1.52.1 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2532990987\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openfoodfacts/robotoff/issues/1416\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openfoodfacts/robotoff/pull/1416/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/pull/1416\">#1416</a>)"}},{"before":"9ac0306f7f12708b967bc85bc1b73227f1376ee0","after":"055ce19b50cdaa00810b9edfa469c017505c45a3","ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-18T07:58:03.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"openfoodfacts-bot","name":"Open Food Facts Bot","path":"/openfoodfacts-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/119524357?s=80&v=4"},"commit":{"message":"chore(main): release 1.52.1","shortMessageHtmlLink":"chore(main): release 1.52.1"}},{"before":"9a09729406c92947633cf3d0157904839533dc46","after":"2748324b40aab6f5c11d31e03d24ac31de93e1af","ref":"refs/heads/main","pushedAt":"2024-09-18T07:57:32.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"docs: :memo: Batch Job - Spellcheck documentation (#1408)\n\n* docs: :memo: Google Batch Job in Robotoff doc\r\n\r\n* fix: :art: Typos\r\n\r\n* docs: :memo: Spellcheck documentation in Robotoff (WIP)\r\n\r\n* docs: :memo: Ingredients Spellcheck Robotoff\r\n\r\n* docs: :memo: Spellcheck Robotoff\r\n\r\n* docs: :memo: Spellcheck Robotoff\r\n\r\n* docs: :sparkles: Spellcheck Robotoff","shortMessageHtmlLink":"docs: 📝 Batch Job - Spellcheck documentation (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2508402771\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openfoodfacts/robotoff/issues/1408\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openfoodfacts/robotoff/pull/1408/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/pull/1408\">#1408</a>)"}},{"before":"cc5e272c369c1c0871d999e96349993f2a0269e0","after":"043a5dbee4cb7bfc69bf9c89a176bbcb4f83f035","ref":"refs/heads/gh-pages","pushedAt":"2024-09-18T07:50:44.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"github-actions[bot]","name":null,"path":"/apps/github-actions","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/15368?s=80&v=4"},"commit":{"message":"Deploying to gh-pages from @ openfoodfacts/robotoff@9a09729406c92947633cf3d0157904839533dc46 🚀","shortMessageHtmlLink":"Deploying to gh-pages from @ <a class=\"commit-link\" data-hovercard-type=\"commit\" data-hovercard-url=\"https://github.com/openfoodfacts/robotoff/commit/9a09729406c92947633cf3d0157904839533dc46/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/commit/9a09729406c92947633cf3d0157904839533dc46\"><tt>9a09729</tt></a> 🚀"}},{"before":"9a09729406c92947633cf3d0157904839533dc46","after":"9ac0306f7f12708b967bc85bc1b73227f1376ee0","ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-18T07:50:25.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"openfoodfacts-bot","name":"Open Food Facts Bot","path":"/openfoodfacts-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/119524357?s=80&v=4"},"commit":{"message":"chore(main): release 1.52.1","shortMessageHtmlLink":"chore(main): release 1.52.1"}},{"before":null,"after":"9a09729406c92947633cf3d0157904839533dc46","ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-18T07:50:24.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"openfoodfacts-bot","name":"Open Food Facts Bot","path":"/openfoodfacts-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/119524357?s=80&v=4"},"commit":{"message":"chore(deps): bump vllm from 0.5.4 to 0.5.5 in /batch/spellcheck\n\nBumps [vllm](https://github.com/vllm-project/vllm) from 0.5.4 to 0.5.5.\n- [Release notes](https://github.com/vllm-project/vllm/releases)\n- [Commits](https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5)\n\n---\nupdated-dependencies:\n- dependency-name: vllm\n  dependency-type: direct:production\n...\n\nSigned-off-by: dependabot[bot] <support@github.com>","shortMessageHtmlLink":"chore(deps): bump vllm from 0.5.4 to 0.5.5 in /batch/spellcheck"}},{"before":"7cefeef6a4beba301c763b1e178227999f9dc695","after":null,"ref":"refs/heads/dependabot/pip/batch/spellcheck/vllm-0.5.5","pushedAt":"2024-09-18T07:49:56.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"}},{"before":"69e8f59d4b69647c460e5984445baf424c467ed2","after":"9a09729406c92947633cf3d0157904839533dc46","ref":"refs/heads/main","pushedAt":"2024-09-18T07:49:55.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"chore(deps): bump vllm from 0.5.4 to 0.5.5 in /batch/spellcheck\n\nBumps [vllm](https://github.com/vllm-project/vllm) from 0.5.4 to 0.5.5.\n- [Release notes](https://github.com/vllm-project/vllm/releases)\n- [Commits](https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5)\n\n---\nupdated-dependencies:\n- dependency-name: vllm\n  dependency-type: direct:production\n...\n\nSigned-off-by: dependabot[bot] <support@github.com>","shortMessageHtmlLink":"chore(deps): bump vllm from 0.5.4 to 0.5.5 in /batch/spellcheck"}},{"before":null,"after":"7cefeef6a4beba301c763b1e178227999f9dc695","ref":"refs/heads/dependabot/pip/batch/spellcheck/vllm-0.5.5","pushedAt":"2024-09-17T21:35:08.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"dependabot[bot]","name":null,"path":"/apps/dependabot","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/29110?s=80&v=4"},"commit":{"message":"chore(deps): bump vllm from 0.5.4 to 0.5.5 in /batch/spellcheck\n\nBumps [vllm](https://github.com/vllm-project/vllm) from 0.5.4 to 0.5.5.\n- [Release notes](https://github.com/vllm-project/vllm/releases)\n- [Commits](https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5)\n\n---\nupdated-dependencies:\n- dependency-name: vllm\n  dependency-type: direct:production\n...\n\nSigned-off-by: dependabot[bot] <support@github.com>","shortMessageHtmlLink":"chore(deps): bump vllm from 0.5.4 to 0.5.5 in /batch/spellcheck"}},{"before":"18b36c8e0c672f199d9f68acffb62a6e61102ef3","after":"69e8f59d4b69647c460e5984445baf424c467ed2","ref":"refs/heads/main","pushedAt":"2024-09-17T15:11:18.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"deps: bump transformers to version 4.44.2","shortMessageHtmlLink":"deps: bump transformers to version 4.44.2"}},{"before":"b37b692ea81ca3b2645b8b10faf58ae1eb7bb3a7","after":null,"ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-17T14:19:56.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"}},{"before":"e3664cadbb27d105996054d1a8dd9adacd75d563","after":"18b36c8e0c672f199d9f68acffb62a6e61102ef3","ref":"refs/heads/main","pushedAt":"2024-09-17T14:19:53.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"chore(main): release 1.52.0","shortMessageHtmlLink":"chore(main): release 1.52.0"}},{"before":"4d1cb565487ef473f6bdfb173ef7e93c4e584529","after":"cc5e272c369c1c0871d999e96349993f2a0269e0","ref":"refs/heads/gh-pages","pushedAt":"2024-09-17T14:16:34.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"github-actions[bot]","name":null,"path":"/apps/github-actions","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/15368?s=80&v=4"},"commit":{"message":"Deploying to gh-pages from @ openfoodfacts/robotoff@e3664cadbb27d105996054d1a8dd9adacd75d563 🚀","shortMessageHtmlLink":"Deploying to gh-pages from @ <a class=\"commit-link\" data-hovercard-type=\"commit\" data-hovercard-url=\"https://github.com/openfoodfacts/robotoff/commit/e3664cadbb27d105996054d1a8dd9adacd75d563/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/commit/e3664cadbb27d105996054d1a8dd9adacd75d563\"><tt>e3664ca</tt></a> 🚀"}},{"before":"dfa2614f8a81cd2326c83dc332054453a4417e4e","after":"e3664cadbb27d105996054d1a8dd9adacd75d563","ref":"refs/heads/main","pushedAt":"2024-09-17T14:06:24.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"tests: fix integration tests","shortMessageHtmlLink":"tests: fix integration tests"}},{"before":"73fa8b1ed59c495b719dc1e2358bf6fb07928660","after":"b37b692ea81ca3b2645b8b10faf58ae1eb7bb3a7","ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-17T13:55:36.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"openfoodfacts-bot","name":"Open Food Facts Bot","path":"/openfoodfacts-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/119524357?s=80&v=4"},"commit":{"message":"chore(main): release 1.52.0","shortMessageHtmlLink":"chore(main): release 1.52.0"}},{"before":"459f6f4689b5e0d5aa0d29a191d8b3fc92af013d","after":"dfa2614f8a81cd2326c83dc332054453a4417e4e","ref":"refs/heads/main","pushedAt":"2024-09-17T13:55:05.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"fix: bump ingredient detection model version","shortMessageHtmlLink":"fix: bump ingredient detection model version"}},{"before":"bd85443bebea9c20406894585cc747efdee0a97c","after":"4d1cb565487ef473f6bdfb173ef7e93c4e584529","ref":"refs/heads/gh-pages","pushedAt":"2024-09-16T16:03:27.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"github-actions[bot]","name":null,"path":"/apps/github-actions","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/15368?s=80&v=4"},"commit":{"message":"Deploying to gh-pages from @ openfoodfacts/robotoff@459f6f4689b5e0d5aa0d29a191d8b3fc92af013d 🚀","shortMessageHtmlLink":"Deploying to gh-pages from @ <a class=\"commit-link\" data-hovercard-type=\"commit\" data-hovercard-url=\"https://github.com/openfoodfacts/robotoff/commit/459f6f4689b5e0d5aa0d29a191d8b3fc92af013d/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/commit/459f6f4689b5e0d5aa0d29a191d8b3fc92af013d\"><tt>459f6f4</tt></a> 🚀"}},{"before":"5d83ee27f3b171328003e2f1aa65e0375f2311df","after":"73fa8b1ed59c495b719dc1e2358bf6fb07928660","ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-16T16:03:08.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"openfoodfacts-bot","name":"Open Food Facts Bot","path":"/openfoodfacts-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/119524357?s=80&v=4"},"commit":{"message":"chore(main): release 1.52.0","shortMessageHtmlLink":"chore(main): release 1.52.0"}},{"before":"6532593fe263a2488485e63cc71121b58f3e6044","after":null,"ref":"refs/heads/update-ingredient-detection-model","pushedAt":"2024-09-16T16:02:39.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"}},{"before":"5f2b94ce67ecc4f697cb194ff2bad702980b9102","after":"459f6f4689b5e0d5aa0d29a191d8b3fc92af013d","ref":"refs/heads/main","pushedAt":"2024-09-16T16:02:37.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"feat: update ingredient detection model\n\nUse v1.1 model:\nhttps://huggingface.co/openfoodfacts/ingredient-detection","shortMessageHtmlLink":"feat: update ingredient detection model"}},{"before":null,"after":"6532593fe263a2488485e63cc71121b58f3e6044","ref":"refs/heads/update-ingredient-detection-model","pushedAt":"2024-09-16T15:59:59.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"feat: update ingredient detection model\n\nUse v1.1 model:\nhttps://huggingface.co/openfoodfacts/ingredient-detection","shortMessageHtmlLink":"feat: update ingredient detection model"}},{"before":"c611e53f98a91b15ac9847235f9ebf1689ac422d","after":"bd85443bebea9c20406894585cc747efdee0a97c","ref":"refs/heads/gh-pages","pushedAt":"2024-09-13T08:21:03.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"github-actions[bot]","name":null,"path":"/apps/github-actions","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/15368?s=80&v=4"},"commit":{"message":"Deploying to gh-pages from @ openfoodfacts/robotoff@5f2b94ce67ecc4f697cb194ff2bad702980b9102 🚀","shortMessageHtmlLink":"Deploying to gh-pages from @ <a class=\"commit-link\" data-hovercard-type=\"commit\" data-hovercard-url=\"https://github.com/openfoodfacts/robotoff/commit/5f2b94ce67ecc4f697cb194ff2bad702980b9102/hovercard\" href=\"https://github.com/openfoodfacts/robotoff/commit/5f2b94ce67ecc4f697cb194ff2bad702980b9102\"><tt>5f2b94c</tt></a> 🚀"}},{"before":"f86ee07d627bd5ae88a85ba6f4fd896d42db8f6c","after":"5d83ee27f3b171328003e2f1aa65e0375f2311df","ref":"refs/heads/release-please--branches--main--components--robotoff","pushedAt":"2024-09-13T08:20:29.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"openfoodfacts-bot","name":"Open Food Facts Bot","path":"/openfoodfacts-bot","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/119524357?s=80&v=4"},"commit":{"message":"chore(main): release 1.51.1","shortMessageHtmlLink":"chore(main): release 1.51.1"}},{"before":"af4ebc1a7b3c055d02115dda4ef915a400772aa5","after":null,"ref":"refs/heads/fix-entities-agg-ner","pushedAt":"2024-09-13T08:20:02.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"}},{"before":"6eae9d591f06eacaa9ea9ca11ceaf6fdc20bc665","after":"5f2b94ce67ecc4f697cb194ff2bad702980b9102","ref":"refs/heads/main","pushedAt":"2024-09-13T08:20:01.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"fix: fix entity aggregation bug for NER detection\n\nIt looks like it’s because we’re using the “FIRST” aggregation strategy,\nwith a tokenizer that is not word-aware: we’re falling back to some\nheuristics (the presence of spaces before/after the word), that somehow\nfails here.\nIndeed, XLM-RoBERTa model does not use the same tokenizer as RoBERTa,\nand uses an Unigram model (instead of BPE), which is not word-aware.\n\nAnother issue of the “FIRST” aggregation strategy is that the ending\ndot after the ingredient list is predicted as part of the ingredient\nlist, even though it’s not in the non-aggregated prediction.\nBy switching to “SIMPLE” strategy (a strategy without an error\ncorrection mechanism), we don’t have this issue anymore, but two\nsubwords belonging to the same word are sometimes predicted as\nbelonging to two entities.\nA more in-depth analysis of the TokenClassificationPipeline reveals\nthat the issue comes from the Punctuation() pre-tokenizer we added:\nit was not included in the original tokenizer, and the heuristic\ndoesn’t take it into account, leading to an incorrect detection.\nI updated the heuristic to use the `word_ids` provided by the tokenizer\nto know whether the token is a subword or not (with respect to the\npre-tokenization output).","shortMessageHtmlLink":"fix: fix entity aggregation bug for NER detection"}},{"before":"a3b4762db41e9bf488ec6d0f0f1aae54b05f7ef6","after":"af4ebc1a7b3c055d02115dda4ef915a400772aa5","ref":"refs/heads/fix-entities-agg-ner","pushedAt":"2024-09-13T08:15:16.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"fix: fix entity aggregation bug for NER detection\n\nIt looks like it’s because we’re using the “FIRST” aggregation strategy,\nwith a tokenizer that is not word-aware: we’re falling back to some\nheuristics (the presence of spaces before/after the word), that somehow\nfails here.\nIndeed, XLM-RoBERTa model does not use the same tokenizer as RoBERTa,\nand uses an Unigram model (instead of BPE), which is not word-aware.\n\nAnother issue of the “FIRST” aggregation strategy is that the ending\ndot after the ingredient list is predicted as part of the ingredient\nlist, even though it’s not in the non-aggregated prediction.\nBy switching to “SIMPLE” strategy (a strategy without an error\ncorrection mechanism), we don’t have this issue anymore, but two\nsubwords belonging to the same word are sometimes predicted as\nbelonging to two entities.\nA more in-depth analysis of the TokenClassificationPipeline reveals\nthat the issue comes from the Punctuation() pre-tokenizer we added:\nit was not included in the original tokenizer, and the heuristic\ndoesn’t take it into account, leading to an incorrect detection.\nI updated the heuristic to use the `word_ids` provided by the tokenizer\nto know whether the token is a subword or not (with respect to the\npre-tokenization output).","shortMessageHtmlLink":"fix: fix entity aggregation bug for NER detection"}},{"before":"b2dce1f40d3b66092fefb804178560c4e2510371","after":"a3b4762db41e9bf488ec6d0f0f1aae54b05f7ef6","ref":"refs/heads/fix-entities-agg-ner","pushedAt":"2024-09-13T08:05:01.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"fix: fix entity aggregation bug for NER detection\n\nIt looks like it’s because we’re using the “FIRST” aggregation strategy,\nwith a tokenizer that is not word-aware: we’re falling back to some\nheuristics (the presence of spaces before/after the word), that somehow\nfails here.\nIndeed, XLM-RoBERTa model does not use the same tokenizer as RoBERTa,\nand uses an Unigram model (instead of BPE), which is not word-aware.\n\nAnother issue of the “FIRST” aggregation strategy is that the ending\ndot after the ingredient list is predicted as part of the ingredient\nlist, even though it’s not in the non-aggregated prediction.\nBy switching to “SIMPLE” strategy (a strategy without an error\ncorrection mechanism), we don’t have this issue anymore, but two\nsubwords belonging to the same word are sometimes predicted as\nbelonging to two entities.\nA more in-depth analysis of the TokenClassificationPipeline reveals\nthat the issue comes from the Punctuation() pre-tokenizer we added:\nit was not included in the original tokenizer, and the heuristic\ndoesn’t take it into account, leading to an incorrect detection.\nI updated the heuristic to use the `word_ids` provided by the tokenizer\nto know whether the token is a subword or not (with respect to the\npre-tokenization output).","shortMessageHtmlLink":"fix: fix entity aggregation bug for NER detection"}},{"before":null,"after":"b2dce1f40d3b66092fefb804178560c4e2510371","ref":"refs/heads/fix-entities-agg-ner","pushedAt":"2024-09-13T08:03:15.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"raphael0202","name":"Raphaël Bournhonesque","path":"/raphael0202","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9609923?s=80&v=4"},"commit":{"message":"fix: fix entity aggregation bug for NER detection\n\nIt looks like it’s because we’re using the “FIRST” aggregation strategy,\nwith a tokenizer that is not word-aware: we’re falling back to some\nheuristics (the presence of spaces before/after the word), that somehow\nfails here.\nIndeed, XLM-RoBERTa model does not use the same tokenizer as RoBERTa,\nand uses an Unigram model (instead of BPE), which is not word-aware.\n\nAnother issue of the “FIRST” aggregation strategy is that the ending\ndot after the ingredient list is predicted as part of the ingredient\nlist, even though it’s not in the non-aggregated prediction.\nBy switching to “SIMPLE” strategy (a strategy without an error\ncorrection mechanism), we don’t have this issue anymore, but two\nsubwords belonging to the same word are sometimes predicted as\nbelonging to two entities.\nA more in-depth analysis of the TokenClassificationPipeline reveals\nthat the issue comes from the Punctuation() pre-tokenizer we added:\nit was not included in the original tokenizer, and the heuristic\ndoesn’t take it into account, leading to an incorrect detection.\nI updated the heuristic to use the `word_ids` provided by the tokenizer\nto know whether the token is a subword or not (with respect to the\npre-tokenization output).","shortMessageHtmlLink":"fix: fix entity aggregation bug for NER detection"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0xOFQwODoxMTowNC4wMDAwMDBazwAAAAS5Yz_h","startCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0xOFQwODoxMTowNC4wMDAwMDBazwAAAAS5Yz_h","endCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0xM1QwODowMzoxNS4wMDAwMDBazwAAAAS1OUcc"}},"title":"Activity · openfoodfacts/robotoff"}