Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Calculation of Patch Numbers in Image Processing and Encoding #21

Open
ziyangliu666 opened this issue Apr 18, 2024 · 1 comment

Comments

@ziyangliu666
Copy link

In image processing, both the original image and sliced images use the same resized_patch_height and resized_patch_width. However, in image encoding, the original image uses abstract_h_num and abstract_w_num, while the sliced image uses slice_h_num and slice_w_num, respectively. There appears to be an inconsistency between the two approaches.

image processing:

slices.append(image)
for image in slices:
image = ToTensor()(image)
image = torch.nn.functional.interpolate(
image.unsqueeze(0),
size=(resized_height, resized_width),
mode="bilinear",
align_corners=False,
antialias=True,
).squeeze(0)
# 需要mask的patch数
num_patches_to_pad = MAX_PATCHES - resized_patch_height*resized_patch_width

image encoding:

for i in range(len(origin_image_widths)):
slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i])
slice_w_nums.append(slice_w_num)
slice_h_nums.append(slice_h_num)
abstract_w_nums.append(abstract_w_num)
abstract_h_nums.append(abstract_h_num)
for i, image in enumerate(split_images):
if i == 7:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
output_hidden_states=True,
w_patch_num = abstract_w_nums,
h_patch_num = abstract_h_nums)
else:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
output_hidden_states=True,
w_patch_num = slice_w_nums,
h_patch_num = slice_h_nums)

@ParadoxZW
Copy link

Hi, @ziyangliu666 !

I've released another implementation of LLaVA-UHD here, which I believe is more stable and elegant. The code of the new repo originates from this repo, but its overall quality is improved, and the training program is tested to be able to normally run without bugs.

When I reviewed this old repo and tried to fix this RuntimeError issue, I found it contains a lot of hidden bugs and calculations with wrong logic (violating the spirit of the original paper), and misses some necessary process (such as, image normalization). So I decided to rewrite the code and try my best to fix all these issues. Now I open-sourced my rewritten version.

You are very welcome to use it, and I look forward to your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants