Missing and unknown text in outline when use `pdf.start_section()` #458

hackinteach · 2022-06-22T15:06:11Z

Error details

I used pdf.start_section() with Thai language and the generated outlines are missing characters and some have extra junk characters. I tried put these words in random order and the error persist with the words.
The problematic words are as follow:

ลักษณะ -> ลัก
ระดับฮอร์โมนเพศชาย -> ระดับฮอร์โมนเพศชาย਩㸾攊摮扯੪〱㤱〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㔱〠删⼠奘⁚⸰‰〵⸱㌷渠汵嵬⼊敎瑸ㄠ㈰‰‰੒倯牡湥⁴〱㠱〠删⼊楔汴⁥ＨӾ『䄎ᤎᤎᐎ㔎䀎ⴎ䜎ᤎ䀎ⴎȎⴎ܎Ў㠎ጎ‎⠀圀攀氀愀氀愀匀挀漀爀攀⤀⤀㸊ਾ湥潤橢ㄊ㈰‰‰扯੪㰼⼊潃湵⁴ਲ䐯獥⁴㉛‱‰⁒堯婙〠〮㘠⸸㌰渠汵嵬⼊楆獲⁴〱ㄲ〠删⼊慌瑳ㄠ㐰‸‰੒丯硥⁴〱㤴〠删⼊慐敲瑮ㄠ㄰‸‰੒倯敲⁶〱㤱〠删⼊楔汴⁥Ｈ◾ㄎĎ⤎ጎ『䀎ऎḎ㈎『Ȏⴎ܎Ў㠎ጎ.....
ระดับฮอร์โมนเพศหญิง -> ระดับฮอร์โมนเพศหญิง਩㸾攊摮扯੪〱〴〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㌳〠删⼠奘⁚⸰‰㠲⸹㌱渠汵嵬⼊敎瑸ㄠ㐰‱‰੒倯牡湥⁴〱ㄲ〠删⼊牐癥ㄠ㌰‹‰੒启瑩敬⠠สีผม਩㸾攊摮扯੪〱ㄴ〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㌳〠删⼠奘⁚⸰‰㌲⸲㐴渠汵嵬⼊敎瑸ㄠ㐰′‰੒倯牡湥⁴〱ㄲ〠删⼊牐癥ㄠ㐰‰‰੒启瑩敬⠠ระดับฮอร์โมนเพศชาย਩㸾攊摮扯੪〱㈴〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㌳〠删⼠奘⁚⸰‰㜱⸵㔷渠汵嵬⼊敎瑸ㄠ㐰″‰੒倯牡湥⁴〱ㄲ〠删⼊牐癥ㄠ㐰‱‰੒启瑩敬⠠ความไวต่อการเจ็บปวด਩㸾攊摮扯......

I noticed that there are some words in the junk text that actually used in the outline such as สีผม, ระดับฮอร์โมนเพศชาย, ความไวต่อการเจ็บปวด.

However, when I try to write minimal code to reproduce the error, the same words as above were disappeared from the outline and the first outline section is incomplete (try code below).

Minimal code
Please download Kanit font to see the Thai characters got rendered.

from fpdf import FPDF
from typing import List

# Make sure all cases use the same code
def generate_pdf(lst: List[str], output_name: str, font_path: str):
    pdf = FPDF()
    pdf.add_font("Kanit", fname=font_path)

    pdf.set_font("Kanit", size=20)
    pdf.set_text_color(0,0,0)

    curr_y = 20

    for i, txt in enumerate(lst):
        pdf.add_page()
        pdf.set_xy(20, curr_y)
        pdf.start_section(txt, level=0)
        pdf.cell(w=pdf.get_string_width(txt), h=10, align="C", txt=txt)

    pdf.output(output_name)

error_lst = ["ลักษณะเฉพาะของคุณ", "ระดับฮอร์โมนเพศชาย", "ระดับฮอร์โมนเพศหญิง", "helllo"]
ok_lst = list("abcdef")
mixed_lst =  ok_lst + error_lst

FONT_PATH = <EDIT_ME>

# this one is missing most of the sections in outline
generate_pdf(error_lst, "error_outline.pdf", FONT_PATH) 

# this one contains all expected outlines
generate_pdf(ok_lst, "ok_outline.pdf", FONT_PATH)

# this one renders outlines for `a,b,c,d,e,f` but incomplete and contains junk with the sections from `error_lst`
generate_pdf(mixed_lst, "mixed_outline.pdf", FONT_PATH)

Environment

OS: OS X 12.4 Monterey
Python version: 3.8.12
fpdf2 version used: 2.5.5

The text was updated successfully, but these errors were encountered:

Lucas-C · 2022-06-22T18:55:53Z

Thank you for reporting this.

This may be related to #365 & #459
I do not know Thai language: is your issue @hackinteach related to tone marks, as described in #459?

hackinteach · 2022-06-23T07:01:20Z

Not really in my opinion, since this is the pdf outline which doesn't use custom font to render.

gmischler · 2022-06-23T07:12:47Z

This may be related to #365 & #459

Or maybe rather related to #320 ?

Lucas-C · 2022-06-23T21:13:56Z

I have a starting point for a fix, in syntax.py:

from binascii import hexlify
import codecs
class PDFString(str):
    def serialize(self):
        # Using the "Hexadecimal String" format defined in the PDF spec:
        return '<%s>' % hexlify(codecs.BOM_UTF16_BE + self.encode('utf-16-be')).decode('latin-1')

With this new string serialization logic, @hackinteach's code generates PDFs that display a readable outline in Adobe Acrobat reader.
Sadly, for some reason that I cannot figure right now, the resulting PDFs display no outline when opened with Sumatra PDF reader!

What's even more frustrating is that if I pass the resulting PDF file to the qpdf command, without any option, the final PDF has an outline visible in both PDF readers! 🤣

qpdf error_outline.pdf error_outline-clean.pdf

But I can't figure exactly what kind of "clean-up" qpdf is performing there...

Any help would be appreciated 😊

gmischler · 2022-06-27T08:24:45Z

But I can't figure exactly what kind of "clean-up" qpdf is performing there...

Sounds like you'd have to compare the binary data in the two files.
Not the most pleasant type of debugging... 😉

Lucas-C · 2022-06-27T12:42:27Z

Sounds like you'd have to compare the binary data in the two files.

As qpdf re-order & re-format all PDF objects, this is quite difficult...
I guess the best approach would be to build a really minimal version of both the "OK" & "KO" files to figure what makes the difference...

Lucas-C · 2022-06-29T07:34:33Z

I submitted #463 that should fix this

gmischler · 2022-06-29T10:42:22Z

Ah yes, in hindsight it makes sense that UTF-16 strings would need a BOM.
Nice catch! 👍

Lucas-C · 2022-06-30T18:38:33Z

The fix as been merged in the master branch, but not released yet.

Note that you can install the latest unreleased version of fpdf2 using this command:

pip install git+https://github.com/PyFPDF/fpdf2.git@master

hackinteach added the bug label Jun 22, 2022

Lucas-C mentioned this issue Jun 22, 2022

Thai font collapse when using more than 1 tone marks #459

Closed

Lucas-C added unicode font labels Jun 22, 2022

Lucas-C self-assigned this Jun 29, 2022

Lucas-C closed this as completed in 22491d3 Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing and unknown text in outline when use `pdf.start_section()` #458

Missing and unknown text in outline when use `pdf.start_section()` #458

hackinteach commented Jun 22, 2022

Lucas-C commented Jun 22, 2022 •

edited

Loading

hackinteach commented Jun 23, 2022

gmischler commented Jun 23, 2022

Lucas-C commented Jun 23, 2022 •

edited

Loading

gmischler commented Jun 27, 2022

Lucas-C commented Jun 27, 2022

Lucas-C commented Jun 29, 2022

gmischler commented Jun 29, 2022

Lucas-C commented Jun 30, 2022 •

edited

Loading

Missing and unknown text in outline when use pdf.start_section() #458

Missing and unknown text in outline when use pdf.start_section() #458

Comments

hackinteach commented Jun 22, 2022

Lucas-C commented Jun 22, 2022 • edited Loading

hackinteach commented Jun 23, 2022

gmischler commented Jun 23, 2022

Lucas-C commented Jun 23, 2022 • edited Loading

gmischler commented Jun 27, 2022

Lucas-C commented Jun 27, 2022

Lucas-C commented Jun 29, 2022

gmischler commented Jun 29, 2022

Lucas-C commented Jun 30, 2022 • edited Loading

Missing and unknown text in outline when use `pdf.start_section()` #458

Missing and unknown text in outline when use `pdf.start_section()` #458

Lucas-C commented Jun 22, 2022 •

edited

Loading

Lucas-C commented Jun 23, 2022 •

edited

Loading

Lucas-C commented Jun 30, 2022 •

edited

Loading