Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing and unknown text in outline when use pdf.start_section() #458

Closed
hackinteach opened this issue Jun 22, 2022 · 9 comments
Closed
Assignees

Comments

@hackinteach
Copy link

Error details

I used pdf.start_section() with Thai language and the generated outlines are missing characters and some have extra junk characters. I tried put these words in random order and the error persist with the words.
The problematic words are as follow:

  • ลักษณะ -> ลัก
  • ระดับฮอร์โมนเพศชาย -> ระดับฮอร์โมนเพศชาย਩㸾攊摮扯੪〱㤱〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㔱〠删⼠奘⁚⸰‰〵⸱㌷渠汵嵬⼊敎瑸ㄠ㈰‰‰੒倯牡湥⁴〱㠱〠删⼊楔汴⁥HӾ『䄎ᤎᤎᐎ㔎䀎ⴎ䜎ᤎ䀎ⴎȎⴎ܎Ў㠎ጎ‎⠀圀攀氀愀氀愀 匀挀漀爀攀⤀⤀㸊ਾ湥潤橢ㄊ㈰‰‰扯੪㰼⼊潃湵⁴ਲ䐯獥⁴㉛‱‰⁒堯婙〠〮㘠⸸㌰渠汵嵬⼊楆獲⁴〱ㄲ〠删⼊慌瑳ㄠ㐰‸‰੒丯硥⁴〱㤴〠删⼊慐敲瑮ㄠ㄰‸‰੒倯敲⁶〱㤱〠删⼊楔汴⁥H◾ㄎĎ⤎ጎ『䀎ऎḎ㈎『Ȏⴎ܎Ў㠎ጎ.....
  • ระดับฮอร์โมนเพศหญิง -> ระดับฮอร์โมนเพศหญิง਩㸾攊摮扯੪〱〴〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㌳〠删⼠奘⁚⸰‰㠲⸹㌱渠汵嵬⼊敎瑸ㄠ㐰‱‰੒倯牡湥⁴〱ㄲ〠删⼊牐癥ㄠ㌰‹‰੒启瑩敬⠠สีผม਩㸾攊摮扯੪〱ㄴ〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㌳〠删⼠奘⁚⸰‰㌲⸲㐴渠汵嵬⼊敎瑸ㄠ㐰′‰੒倯牡湥⁴〱ㄲ〠删⼊牐癥ㄠ㐰‰‰੒启瑩敬⠠ระดับฮอร์โมนเพศชาย਩㸾攊摮扯੪〱㈴〠漠橢㰊਼䌯畯瑮〠⼊敄瑳嬠㌳〠删⼠奘⁚⸰‰㜱⸵㔷渠汵嵬⼊敎瑸ㄠ㐰″‰੒倯牡湥⁴〱ㄲ〠删⼊牐癥ㄠ㐰‱‰੒启瑩敬⠠ความไวต่อการเจ็บปวด਩㸾攊摮扯......

I noticed that there are some words in the junk text that actually used in the outline such as สีผม, ระดับฮอร์โมนเพศชาย, ความไวต่อการเจ็บปวด.

However, when I try to write minimal code to reproduce the error, the same words as above were disappeared from the outline and the first outline section is incomplete (try code below).

Minimal code
Please download Kanit font to see the Thai characters got rendered.

from fpdf import FPDF
from typing import List

# Make sure all cases use the same code
def generate_pdf(lst: List[str], output_name: str, font_path: str):
    pdf = FPDF()
    pdf.add_font("Kanit", fname=font_path)

    pdf.set_font("Kanit", size=20)
    pdf.set_text_color(0,0,0)

    curr_y = 20

    for i, txt in enumerate(lst):
        pdf.add_page()
        pdf.set_xy(20, curr_y)
        pdf.start_section(txt, level=0)
        pdf.cell(w=pdf.get_string_width(txt), h=10, align="C", txt=txt)

    pdf.output(output_name)

error_lst = ["ลักษณะเฉพาะของคุณ", "ระดับฮอร์โมนเพศชาย", "ระดับฮอร์โมนเพศหญิง", "helllo"]
ok_lst = list("abcdef")
mixed_lst =  ok_lst + error_lst

FONT_PATH = <EDIT_ME>

# this one is missing most of the sections in outline
generate_pdf(error_lst, "error_outline.pdf", FONT_PATH) 

# this one contains all expected outlines
generate_pdf(ok_lst, "ok_outline.pdf", FONT_PATH)

# this one renders outlines for `a,b,c,d,e,f` but incomplete and contains junk with the sections from `error_lst`
generate_pdf(mixed_lst, "mixed_outline.pdf", FONT_PATH)

Error Result

OK Result

Mixed result

Environment

  • OS: OS X 12.4 Monterey
  • Python version: 3.8.12
  • fpdf2 version used: 2.5.5
@Lucas-C
Copy link
Member

Lucas-C commented Jun 22, 2022

Thank you for reporting this.

This may be related to #365 & #459
I do not know Thai language: is your issue @hackinteach related to tone marks, as described in #459?

@hackinteach
Copy link
Author

Not really in my opinion, since this is the pdf outline which doesn't use custom font to render.

@gmischler
Copy link
Collaborator

This may be related to #365 & #459

Or maybe rather related to #320 ?

@Lucas-C
Copy link
Member

Lucas-C commented Jun 23, 2022

I have a starting point for a fix, in syntax.py:

from binascii import hexlify
import codecs
class PDFString(str):
    def serialize(self):
        # Using the "Hexadecimal String" format defined in the PDF spec:
        return '<%s>' % hexlify(codecs.BOM_UTF16_BE + self.encode('utf-16-be')).decode('latin-1')

With this new string serialization logic, @hackinteach's code generates PDFs that display a readable outline in Adobe Acrobat reader.
Sadly, for some reason that I cannot figure right now, the resulting PDFs display no outline when opened with Sumatra PDF reader!

What's even more frustrating is that if I pass the resulting PDF file to the qpdf command, without any option, the final PDF has an outline visible in both PDF readers! 🤣

qpdf error_outline.pdf error_outline-clean.pdf

But I can't figure exactly what kind of "clean-up" qpdf is performing there...

Any help would be appreciated 😊

@gmischler
Copy link
Collaborator

But I can't figure exactly what kind of "clean-up" qpdf is performing there...

Sounds like you'd have to compare the binary data in the two files.
Not the most pleasant type of debugging... 😉

@Lucas-C
Copy link
Member

Lucas-C commented Jun 27, 2022

Sounds like you'd have to compare the binary data in the two files.

As qpdf re-order & re-format all PDF objects, this is quite difficult...
I guess the best approach would be to build a really minimal version of both the "OK" & "KO" files to figure what makes the difference...

@Lucas-C
Copy link
Member

Lucas-C commented Jun 29, 2022

I submitted #463 that should fix this

@Lucas-C Lucas-C self-assigned this Jun 29, 2022
@gmischler
Copy link
Collaborator

Ah yes, in hindsight it makes sense that UTF-16 strings would need a BOM.
Nice catch! 👍

@Lucas-C
Copy link
Member

Lucas-C commented Jun 30, 2022

The fix as been merged in the master branch, but not released yet.

Note that you can install the latest unreleased version of fpdf2 using this command:

pip install git+https://github.com/PyFPDF/fpdf2.git@master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants