Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Unicode Line Breaking Algorithm wrapping with po2md #153

Closed
dingyifei opened this issue Jul 12, 2021 · 19 comments · Fixed by #273 or #274
Closed

Use Unicode Line Breaking Algorithm wrapping with po2md #153

dingyifei opened this issue Jul 12, 2021 · 19 comments · Fixed by #273 or #274
Labels
enhancement New feature or request rendering-format Some related with a problem with rendering format

Comments

@dingyifei
Copy link

dingyifei commented Jul 12, 2021

Two issues are described below:

  1. Scriptio continua languages incompatible with wrap width

It threw an error when the following Gettext is being processed.

The section of Gettext:


#: docs/Features.md:block 5 (unordered list)
msgid ""
"Support for common temperature sensors (eg, common thermistors, AD595, "
"AD597, AD849x, PT100, PT1000, MAX6675, MAX31855, MAX31856, MAX31865, BME280,"
" HTU21D, and LM75). Custom thermistors and custom analog temperature sensors"
" can also be configured."
msgstr ""
"支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。"

Error

$ po2md docs/Features.md --md-encoding utf-8 -p docs/locales/zh-hans/LC_MESSAGES/Features.po -q -s ./docs/locales/zh-hans/Features.md
Traceback (most recent call last):
  File "c:\program files\python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\program files\python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Program Files\Python39\Scripts\po2md.exe\__main__.py", line 7, in <module>
  File "c:\program files\python39\lib\site-packages\mdpo\po2md\__main__.py", line 117, in main
    sys.exit(run(args=sys.argv[1:])[1])  # pragma: no cover
  File "c:\program files\python39\lib\site-packages\mdpo\po2md\__main__.py", line 101, in run
    output = pofile_to_markdown(
  File "c:\program files\python39\lib\site-packages\mdpo\po2md\__init__.py", line 882, in pofile_to_markdown
    return Po2Md(
  File "c:\program files\python39\lib\site-packages\mdpo\po2md\__init__.py", line 804, in translate
    parser.parse(
  File "c:\program files\python39\lib\site-packages\mdpo\po2md\__init__.py", line 485, in leave_block
    self._save_current_msgid()
  File "c:\program files\python39\lib\site-packages\mdpo\po2md\__init__.py", line 337, in _save_current_msgid
    lines = wrap_different_first_line_width(
  File "c:\program files\python39\lib\site-packages\mdpo\text.py", line 164, in wrap_different_first_line_width
    li_subsequent_lines[0] = li_subsequent_lines[0].lstrip()
IndexError: list index out of range

I tried to fiddle with the code a bit, and it seems like break_long_words=True under if self._inside_liblock or self._inside_quoteblock: can solve this error (although not perfectly). Since I also found a second issue that relates to text wrap, I'm thinking if they could be fixed together.

  1. Links are being wrapped incorrectly
    图片
    textwrap attempts to wrap links when a line is len(text) > width. Both Github and VSC Markdown engines couldn't render the links with a link break in the middle. Setting break_on_hyphens=false resolves this issue, but it definitely causes text wrap to produce less optimal wraps.

I'm not sure how to fix these two problems since these fixes can bring drawbacks. Maybe adding additional parameters for TextWrapper is an intermediate solution, but wrapping according to the string length including the link length might still cause the index out-of-range error when a very long link without - is in a string.

@mondeja mondeja added the bug Something isn't working label Jul 12, 2021
@dingyifei
Copy link
Author

dingyifei commented Jul 12, 2021

Certainly having break_long_words=True on breaks the quick fix of break_on_hyphens=false for links.
图片

I propose treating the links as inline code blocks or a more optimal solution where it allows something like this:

Original:

[11111111111111111111111111111111111111111111111111111111111111111111111111111111111 2222222222222](wehfiuhiwehriewhriuwe)

Output:

[11111111111111111111111111111111111111111111111111111111111111111111111111111111111](wehfiuhiwehriewhriuwe)
[2222222222222](wehfiuhiwehriewhriuwe)

@dingyifei
Copy link
Author

dingyifei commented Jul 12, 2021

In terms of Scriptio continua, I suspect they simply just need an easier treatment of breaking at whatever point that's convenient, but I think it might need some additional detection if a line still has space or other characters like numerical value, it could be sensible to linebreak on that space in some cases. The CJK encoding thingy might be helpful for Chinese, Japanese, and Korean script continua.

https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode#1366113

@mondeja
Copy link
Owner

mondeja commented Jul 14, 2021

  1. Scriptio continua languages incompatible with wrap width

This has been fixed in v0.3.60.

@dingyifei
Copy link
Author

  1. Scriptio continua languages incompatible with wrap width

This has been fixed in v0.3.60.

Thank you! Although it solves the problem, I think this solution, which stops -w from making effect is suboptimal. I fiddled with the code last two days, and this is a similar solution I came up with: #154 It enables wrap long words when a scriptio continua appear. I'm not sure what's the side effect of turning it on. I'm not good at coding so the confidence level of my solution works correctly is low. At least it works for the line which previously threw an error:
图片

@mondeja
Copy link
Owner

mondeja commented Jul 14, 2021

Thank you! Although it solves the problem, I think this solution, which stops -w from making effect is suboptimal

Please, could you share a reproducible example? Where is -w not making effect? I don't read nor understand chinese.

Links are being wrapped incorrectly

Should have been fixed in v0.3.61.

@dingyifei
Copy link
Author

dingyifei commented Jul 14, 2021

While I'm using #154, I noticed the --width is not well enforced for text with different actual widths(numbers, Chinese characters, periods, and more are affected by Kerning). The assumption of "fixed-width" text isn't optimal for markdown text in my opinion, since different rendering methods with different fonts and characters sometimes take more than one space.

Below is an example, the code block use monospace fixed-width and the normal markdown rendering doesn't
-
..
1-
1..
https://stackoverflow.com/questions/43828955/measuring-width-of-text-python-pil

A reproducible example is the following

input

* Support for common temperature sensors (eg, common thermistors,
  AD595, AD597, AD849x, PT100, PT1000, MAX6675, MAX31855, MAX31856,
  MAX31865, BME280, HTU21D, and LM75). Custom thermistors and custom
  analog temperature sensors can also be configured.

input po

#
msgid ""
msgstr ""

msgid ""
"Support for common temperature sensors (eg, common thermistors, "
"AD595, AD597, AD849x, PT100, PT1000, MAX6675, MAX31855, MAX31856, "
"MAX31865, BME280, HTU21D, and LM75). Custom thermistors and custom "
"analog temperature sensors can also be configured."
msgstr ""
"支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。"

output md note it is single line

* 支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。

My modification results in the following output

* 支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX3185
6、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。

Note the above first line is more than 80 in width. It is significantly longer since Chinese is double spaced

@dingyifei
Copy link
Author

Links are being wrapped incorrectly

Should have been fixed in v0.3.61.

It's fixed on all of my markdown files too, thank you!

@dingyifei
Copy link
Author

dingyifei commented Jul 14, 2021

I found another test case where the -w isn't applied, this one is probably a little bit more universal. I tested it with some English characters as well, for example, hellooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo, the -w isn't applied to hellooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo.

mdpo.zip

Additionally, I don't think my

        if len(li_first_line) > width:
            kwargs["break_long_words"] = True
            li_first_line = textwrap.wrap(
            text,
            width=width + first_line_width_diff,
            max_lines=4,
            placeholder='?',
            **kwargs,
        )[0]
        li_subsequent_lines = textwrap.wrap(
            text[len(li_first_line):],
            width=width,
            **kwargs,
        )
        li_subsequent_lines[0] = li_subsequent_lines[0].lstrip()

make any difference in this case (which is strange)
do anything in v0.3.61

@mondeja
Copy link
Owner

mondeja commented Jul 14, 2021

While I'm using #154, I noticed the --width is not well enforced for text with different actual widths(numbers, Chinese characters, periods, and more are affected by Kerning)...

It seems that textwrap considers that string a long word because characters like , and are not spaces. There is the reproducible example:

>>> import textwrap
>>> text = "支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定 义热敏电阻和自定义模拟温度传感器。"
>>> textwrap.wrap(text, break_long_words=False)
['支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。']
>>> textwrap.wrap(string.replace('、', ' '), break_long_words=False)
['支持常见的温度传感器(例如,常见的热敏电阻 AD595 AD597 AD849x PT100 PT1000 MAX6675 MAX31855', 'MAX31856 MAX31865 BME280 HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。']
>>> textwrap.wrap(string, break_long_words=True)
['支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855', '、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。']

The problem is just simple and the solution I'm afraid that will not reach soon. It seems that Python is missing a reliable Unicode Line Breaking Algorithm implementation, so the more high level developers opt for the ASCII-oriented textwrap, which doesn't have a good unicode support (see more bugs.python.org#41975, izimobil/polib#96, AFPy/powrap#76).

@mondeja mondeja closed this as completed Jul 14, 2021
@mondeja
Copy link
Owner

mondeja commented Jul 14, 2021

the -w isn't applied to hellooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

Why it would be applied? It's a word, can't be splitted. Remember: a soft break \n in markdown means a space inside text blocks.

@dingyifei
Copy link
Author

the -w isn't applied to hellooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

Why it would be applied? It's a word, can't be splitted.

Basically, for languages that used scriptio continua, the words have no space in between, similar to an extremely long word. I don't think there is an existing way to solve this problem, guess I'll see if there is anything I could do using the image drawing thingy.

@mondeja
Copy link
Owner

mondeja commented Jul 14, 2021

Basically, for languages that used scriptio continua, the words have no space in between

You're referring practically to chinese and japanese, right? The Unicode Line Breaking Algorithm can handle those cases, you can see at 6. Line Breaking Algorithm that defines non mandatory line break opportunities.

@dingyifei
Copy link
Author

Basically, for languages that used scriptio continua, the words have no space in between

You're referring practically to chinese and japanese, right? The Unicode Line Breaking Algorithm can handle those cases, you can see at 6. Line Breaking Algorithm that defines non mandatory line break opportunities.

Thank you! I'll look into it!

@mondeja mondeja reopened this Mar 9, 2022
@mondeja mondeja changed the title bug: po2md wrapwidth issues Consider Unicode Line Breaking Algorithm wrapping with po2md May 9, 2022
@mondeja mondeja added rendering-format Some related with a problem with rendering format enhancement New feature or request help wanted Extra attention is needed and removed bug Something isn't working labels May 9, 2022
@mondeja mondeja changed the title Consider Unicode Line Breaking Algorithm wrapping with po2md Use Unicode Line Breaking Algorithm wrapping with po2md May 9, 2022
@mondeja mondeja removed the help wanted Extra attention is needed label May 10, 2022
@mondeja
Copy link
Owner

mondeja commented Feb 10, 2023

This problem can be solved with https://github.com/mondeja/py-unicode-linebreak, I'll try to do it ASAP.

@mondeja
Copy link
Owner

mondeja commented Feb 10, 2023

Hi @dingyifei, the problem seems solved in v1.1.1.

I've used your example to implement it as I'm not used to read these languages, so I don't really know if is correctly fixed. If you find problems just let me know.

@dingyifei
Copy link
Author

dingyifei commented Feb 10, 2023

Thank you for the update! There may be some room for improvement:

 po2md ./docs/Features.md -w 25 -s test.md -p ./docs/locales/zh_Hans/LC_MESSAGES/Features.po 
 po2md ./docs/Features.md -w 20 -s test1.md -p ./docs/locales/zh_Hans/LC_MESSAGES/Features.po 

test.md
test1.md

sections of test1.md:

* 高精度步进运动。Klipper使用一个应用处理器(例如低成本的树莓派)来计算打印机运动。应用处理器决定何时对每个步进电机发出步进信号,压缩这些事件,并将它们发送到微控制器。微处理器将会把每个事件按请求时间执行。每一个步进事件被以25毫秒或更高的精度安排。Klipper不使用运动估计,例如
Bresenham算法,而是通过加速度与机械运动物理计算精确的步进时间。更精准的步进电机运动意味着打印机更安静和稳定的运行。
* 同类项目中最佳的性能。
Klipper
能够在新旧微控制器上实现高步进速率。即使是旧的
8 位微控制器也可以发送超过每秒
175K
步的速率。在较新的微控制器上,每秒数百万步也可以实现。更高的步进速率可以实现更高的打印速度。步进事件计时即使在高速下也能保持精确,提高了整体稳定性。

You can see it not applying any wrapping for several rows.

A better way to do it may approximately be:

* 高精度步进运动。Klipper使用一个应用处理器(
例如低成本的树莓派)来计算打印机运动。应用处理
器决定何时对每个步进电机发出步进信号,压缩这些
事件,并将它们发送到微控制器。微处理器将会把每
个事件按请求时间执行。每一个步进事件被以25毫秒
或更高的精度安排。Klipper不使用运动估计,例如
Bresenham算法,而是通过加速度与机械运动物理计
算精确的步进时间。更精准的步进电机运动意味着打
印机更安静和稳定的运行。
* 同类项目中最佳的性能。Klipper 能够在新旧微控
制器上实现高步进速率。即使是旧的 8 位微控制器也
可以发送超过每秒175K步的速率。在较新的微控制器
上,每秒数百万步也可以实现。更高的步进速率可以
实现更高的打印速度。步进事件计时即使在高速下也
能保持精确,提高了整体稳定性。

A test case maybe:
Input:

sssssssss,sssssssssssssssabcd, ssssabcdsssssabcdsssss,sss,abcdsssss.abcdsssss.sssabcdefgss.

Output:

sssssssss,
sssssssss
ssssss
abcd,ssss
abcdsssss
abcdsssss,
sss,
abcdsssss.
abcdsssss.
sss
abcdefgss.

Where the s are something like chinese and abcdefg are smh like english.

@dingyifei
Copy link
Author

dingyifei commented Feb 10, 2023

In some parts, it seems to do a better job:

* 自定义可编程脚本。可以在打印机配置
文件中定义新的G代码命令(而不需要修
改任何代码)。这些命令都是可编程的,
可以能根据打印机的状态做出不同的响应。

I think "starts with a english word" triggers it, where a example of the bug would be

* Klipper
支持带有多个微控制器的打印机。例如,一个微控制器可以被用来控制挤出机,而另一个用来控制加热器,并使用第三个来控制其他的打印机组件。Klipper
主机程序实现了时钟同步,解决了微处理器之间的时钟漂移。

@mondeja mondeja reopened this Feb 10, 2023
@dingyifei
Copy link
Author

I realigned the example abit, note that the comma and periods ,。take two space but gets pushed to column n+1. I don't think this behavior is worth implementing as default behavior, but the later example showing incorrect behavior of wrapping no longer take effect in certain cases is a clear issue.

@mondeja
Copy link
Owner

mondeja commented Feb 19, 2023

Thanks for the detailed report @dingyifei. This problem is fixed in v1.1.2 . Probably is not perfect, if you find more inconsistencies, please, share the reproducible examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rendering-format Some related with a problem with rendering format
Projects
None yet
2 participants