Add support for extracting more fields #2

IsmailM · 2022-07-11T07:20:35Z

The following fields are not currently being extracted:

Fixation Monitor
Fixation Target
Stimulus
Background
Visual Acuity

This should be implemented something like the following:

https://github.com/msaifee786/hvf_extraction_script/blob/e978747233887322e66fd7537b7269eb00be1d55/hvf_extraction_script/hvf_data/hvf_object.py#L1055-L1071

alanwilter · 2022-07-16T14:41:16Z

On how HVF Extraction script (`HVFES`) works (or should).

Layout Detection

VF reports are of 3 layout types, we are mostly interested in V3. For HVFES to detect a V3 format, input JPG needs to be:

Width > 1400px, which implies JPG with 300dpi;
Need to have words Date of Birth detected (so, the big square covering that field won't help);

A 200dpi could work for general extraction, but it will be deemed V1. A 400dpi is, believe it or not, worse!

Layout Detection Issues

Apparently, tesseract has some issue with black boxes (readacts) near text and dpi resolution.

For example (for processed head slice in B&W):

$ tesseract f5.jpg f5 --dpi 300 && grep -i date f5.txt # Won't work!

# but

$ tesseract f5.jpg f5 --dpi 200 && grep -i date f5.txt # is fine
Date of Birth:

If I use Mac Preview redact tool or white boxes (instead of black), results are much better.

header_slice from image redacted Mac Preview:

Image Processing

Any image whose width is less than 2500px will be resized to that minimum width. In my tests I don't see the need for that if we stick with modern scans.

In hvf_object.py:

        # First, need to upscale image if its too low resolution (important for older HVF
        # images). Min width is a bit arbitrary but is close to ~300ppi
        width = np.size(hvf_image, 1)
        MIN_HVF_WIDTH = 2500

Image is tuned to gray, try to detect layout, and then converted B&W for text extraction (def get_header_metadata_from_hvf_image()).

Layout detection is important because HVFES split the images in several sections (to optimise OCR according to the authors).

Header is split in 3 or 4 parts (depending on the layout):

header_slice_image1 (for V2 and V3)

Middle

header_slice_image2 (only V2)

header_slice_image3 (only V2)

header_slice_image_middle (only for V3)

header_slice_image4 (for V2 and V3)

As it can be seen, applying to the wrong layout will hinder the extraction.

Other slices, used in def get_metric_metadata_from_hvf_image(), according to layout:

dev_val_slice_imageV2

dev_val_slice_imageV3

Most changes are being done at visual_fields_extraction/hvf_extraction_script/hvf_data/hvf_object.py, where I added the new key fields for extraction.

alanwilter · 2022-07-16T14:59:17Z

To improve the efficient of the new field detection, we should use a list of choices for each new field, assuming those fields have restricted choices:

Fixation Monitor: Gaze/Blind Spot (what else?)
Fixation Target: Central (what else?)
Stimulus: III, White (OCR usually gets "Ill, White") (what else?)
Background: 31.5 ASB (what else?)
Visual Acuity: EMPTY so far

alanwilter · 2022-07-16T16:20:32Z

We need to improve layout detection. So, first, we cannot redact the words "Date of Birth:" (but we must redact the DD/MM/YYYY field).

If we want gender, same as above.

I'm investigating how to make tesseract to handle better black boxes. I'm tweaking the code to help me debug it as well.

alanwilter · 2022-07-17T17:26:42Z

I've done as much as I could for now.

So it works for 300 to 400 dpi but not with 200 dpi (it fails to detect "Date of Birth").
Redacts can be in Black or Grey or any colour in principle.
For correct V3 detection, Date of Birth key cannot be redacted.

Things we can do:

Use another field to disambiguate layouts, like version in the bottom of the reports.
Convert date string in datetime compatible with JSON.
- If so, the examples I have are using date string in American format, even the MEH example, is that correct?
Improve tesseract with EAST?
Explore aws rekognition?

We need more reports to test. And if we're going to ever use V2 layout, we need several examples as well.

alanwilter self-assigned this Jul 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for extracting more fields #2

Add support for extracting more fields #2

IsmailM commented Jul 11, 2022 •

edited

Loading

alanwilter commented Jul 16, 2022 •

edited

Loading

alanwilter commented Jul 16, 2022 •

edited

Loading

alanwilter commented Jul 16, 2022

alanwilter commented Jul 17, 2022

Add support for extracting more fields #2

Add support for extracting more fields #2

Comments

IsmailM commented Jul 11, 2022 • edited Loading

alanwilter commented Jul 16, 2022 • edited Loading

On how HVF Extraction script (HVFES) works (or should).

Layout Detection

Layout Detection Issues

Image Processing

alanwilter commented Jul 16, 2022 • edited Loading

alanwilter commented Jul 16, 2022

alanwilter commented Jul 17, 2022

IsmailM commented Jul 11, 2022 •

edited

Loading

alanwilter commented Jul 16, 2022 •

edited

Loading

On how HVF Extraction script (`HVFES`) works (or should).

alanwilter commented Jul 16, 2022 •

edited

Loading