Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for extracting more fields #2

Open
IsmailM opened this issue Jul 11, 2022 · 4 comments
Open

Add support for extracting more fields #2

IsmailM opened this issue Jul 11, 2022 · 4 comments
Assignees

Comments

@IsmailM
Copy link
Member

IsmailM commented Jul 11, 2022

The following fields are not currently being extracted:

  • Fixation Monitor
  • Fixation Target
  • Stimulus
  • Background
  • Visual Acuity

This should be implemented something like the following:

https://github.com/msaifee786/hvf_extraction_script/blob/e978747233887322e66fd7537b7269eb00be1d55/hvf_extraction_script/hvf_data/hvf_object.py#L1055-L1071

@alanwilter
Copy link

alanwilter commented Jul 16, 2022

On how HVF Extraction script (HVFES) works (or should).

Layout Detection

VF reports are of 3 layout types, we are mostly interested in V3. For HVFES to detect a V3 format, input JPG needs to be:

  1. Width > 1400px, which implies JPG with 300dpi;
  2. Need to have words Date of Birth detected (so, the big square covering that field won't help);

Screenshot 2022-07-16 at 11 41 38

A 200dpi could work for general extraction, but it will be deemed V1. A 400dpi is, believe it or not, worse!

Layout Detection Issues

Apparently, tesseract has some issue with black boxes (readacts) near text and dpi resolution.

For example (for processed head slice in B&W):

Screenshot 2022-07-16 at 15 22 35

$ tesseract f5.jpg f5 --dpi 300 && grep -i date f5.txt # Won't work!

# but

$ tesseract f5.jpg f5 --dpi 200 && grep -i date f5.txt # is fine
Date of Birth:

If I use Mac Preview redact tool or white boxes (instead of black), results are much better.

header_slice from image redacted Mac Preview:
h1

Image Processing

Any image whose width is less than 2500px will be resized to that minimum width. In my tests I don't see the need for that if we stick with modern scans.

In hvf_object.py:

        # First, need to upscale image if its too low resolution (important for older HVF
        # images). Min width is a bit arbitrary but is close to ~300ppi
        width = np.size(hvf_image, 1)
        MIN_HVF_WIDTH = 2500

Image is tuned to gray, try to detect layout, and then converted B&W for text extraction (def get_header_metadata_from_hvf_image()).

Layout detection is important because HVFES split the images in several sections (to optimise OCR according to the authors).

Header is split in 3 or 4 parts (depending on the layout):

  1. header_slice_image1 (for V2 and V3)

Screenshot 2022-07-16 at 16 46 40

  1. Middle
  • header_slice_image2 (only V2)

Screenshot 2022-07-16 at 16 44 52

  • header_slice_image3 (only V2)

Screenshot 2022-07-16 at 16 44 12

  • header_slice_image_middle (only for V3)

Screenshot 2022-07-16 at 16 45 36

  1. header_slice_image4 (for V2 and V3)

Screenshot 2022-07-16 at 16 42 27

As it can be seen, applying to the wrong layout will hinder the extraction.

Other slices, used in def get_metric_metadata_from_hvf_image(), according to layout:

  1. dev_val_slice_imageV2

Screenshot 2022-07-16 at 16 38 33

  1. dev_val_slice_imageV3

Screenshot 2022-07-16 at 16 39 25

Most changes are being done at visual_fields_extraction/hvf_extraction_script/hvf_data/hvf_object.py, where I added the new key fields for extraction.

@alanwilter alanwilter self-assigned this Jul 16, 2022
@alanwilter
Copy link

alanwilter commented Jul 16, 2022

To improve the efficient of the new field detection, we should use a list of choices for each new field, assuming those fields have restricted choices:

  • Fixation Monitor: Gaze/Blind Spot (what else?)
  • Fixation Target: Central (what else?)
  • Stimulus: III, White (OCR usually gets "Ill, White") (what else?)
  • Background: 31.5 ASB (what else?)
  • Visual Acuity: EMPTY so far

@alanwilter
Copy link

We need to improve layout detection. So, first, we cannot redact the words "Date of Birth:" (but we must redact the DD/MM/YYYY field).

If we want gender, same as above.

I'm investigating how to make tesseract to handle better black boxes. I'm tweaking the code to help me debug it as well.

@alanwilter
Copy link

I've done as much as I could for now.

  • So it works for 300 to 400 dpi but not with 200 dpi (it fails to detect "Date of Birth").
  • Redacts can be in Black or Grey or any colour in principle.
  • For correct V3 detection, Date of Birth key cannot be redacted.

Things we can do:

  • Use another field to disambiguate layouts, like version in the bottom of the reports.
  • Convert date string in datetime compatible with JSON.
    • If so, the examples I have are using date string in American format, even the MEH example, is that correct?
  • Improve tesseract with EAST?
  • Explore aws rekognition?

We need more reports to test. And if we're going to ever use V2 layout, we need several examples as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants