Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in mAP computation of Florence-2 fine-tuning notebook #294

Open
1 of 2 tasks
patel-zeel opened this issue Jul 27, 2024 · 6 comments
Open
1 of 2 tasks
Labels
bug Something isn't working

Comments

@patel-zeel
Copy link

Search before asking

  • I have searched the Roboflow Notebooks issues and found no similar bug report.

Notebook name

Fine-tuning Florence-2 on Object Detection Dataset

Bug

Predictions from Florence-2 fine-tuned model look like the following:

[Detections(xyxy=array([[ 52.8    , 237.76   , 169.28   , 470.08   ],
        [373.44   , 113.6    , 512.32   , 358.08   ],
        [161.59999, 330.56   , 301.75998, 585.27997],
        [311.36   , 360.     , 447.03998, 616.64   ],
        [173.12   ,  14.4    , 303.03998, 253.12   ]], dtype=float32), mask=None, confidence=array([1., 1., 1., 1., 1.]), class_id=array([34, 50, 46,  2, 33]), tracker_id=None, data={'class_name': array(['9 of hearts', 'queen of hearts', 'king of hearts', '10 of hearts',
        '9 of diamonds'], dtype='<U15')}),
 Detections(xyxy=array([[3.3056000e+02, 4.2559998e+01, 5.1679999e+02, 2.0703999e+02],
        [2.0128000e+02, 8.2239998e+01, 3.8112000e+02, 3.2351999e+02],
        [3.1999999e-01, 1.2959999e+02, 2.6719998e+02, 4.1312000e+02],
        [1.9808000e+02, 1.7375999e+02, 4.8863998e+02, 4.9887997e+02]],
       dtype=float32), mask=None, confidence=array([1., 1., 1., 1.]), class_id=array([16, 24, 32, 28]), tracker_id=None, data={'class_name': array(['5 of clubs', '7 of clubs', '9 of clubs', '8 of clubs'],
       dtype='<U10')}),
 Detections(xyxy=array([[369.6    , 234.56   , 517.44   , 490.56   ],
        [ 87.36   , 163.51999, 255.04   , 402.24   ]], dtype=float32), mask=None, confidence=array([1., 1.]), class_id=array([35, 44]), tracker_id=None, data={'class_name': array(['9 of spades', 'king of clubs'], dtype='<U17')}),
 Detections(xyxy=array([[ 56.     , 228.79999, 331.84   , 636.48   ]], dtype=float32), mask=None, confidence=array([1.]), class_id=array([31]), tracker_id=None, data={'class_name': array(['8 of spades'], dtype='<U13')})]

It seems that the confidence score is always 1. Wouldn't this cause an issue in creating the precision-recall curve followed by computing mAP?

Environment

NA

Minimal Reproducible Example

NA

Additional

NA

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@patel-zeel patel-zeel added the bug Something isn't working label Jul 27, 2024
@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 1, 2024

Yup, it is not perfect. We try to apply traditional computer vision metrics to models that exist outside of traditional computer vision space. In the case of Florence-2, it is a VLM. When VLM performs object detection, all of the boxes have the same probability - confidence 100%.

@patel-zeel
Copy link
Author

patel-zeel commented Aug 2, 2024

Thank you for your response, @SkalskiP. I was wondering what a fair comparison would be in such cases. For example, should we also convert traditional models' confidence scores to 1 before computing mAP?

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 6, 2024

I don't know how to do it now. However, given the growth of VLMs over the past 1-2 years, I think it will be an important issue if we measure VLM performance.

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 7, 2024

@patel-zeel, your question motivated me to reach out to Lucas Beyer, he is leading the team behind PaliGemma. Looks like there is no better way to do it than just mAP with confidence = 100. He suggested using both AP and AR for more diverse comparison.

@patel-zeel
Copy link
Author

patel-zeel commented Aug 7, 2024

@SkalskiP Thank you for the update and follow-up on this! Great to hear the feedback from the PaliGemma lead.

He suggested using both AP and AR for a more diverse comparison.

If I understand correctly, it means,

  • For VLMs like Florence-2, we can only compute "P" and "R" at a given IoU.
  • For traditional models, we can compute "AP" and "AR" at a given IoU using different confidence scores.
  • We compare "P" with "AP" and "R" with "AR".

That sounds reasonable and motivates me to look even deeper into this.

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 7, 2024

That's what I'll do for now. Yup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants