Multiline msgids aren't wrapped the same way that xgettext wraps them #96

verhovsky · 2020-11-12T14:36:50Z

xgettext -c somefile.c can produce a po file with an entry like this:

msgid ""
"this is a long piece of text that should wrap on to multiple lines this is a "
"long piece of text that should wrap on to multiple lines"
msgstr ""

but then if you just re-save it using from polib import pofile; pofile("messages.po", encoding="utf-8").save(), it will wrap it differently:

msgid ""
"this is a long piece of text that should wrap on to multiple lines this is a"
" long piece of text that should wrap on to multiple lines"
msgstr ""

polib (really Python's textwrap.wrap() method) puts the space at the beginning of the second line instead of at the end of the first.

This is an issue because using a command line tool that uses polib on a Django project would cause churn in the git history as it shuffles spaces between lines.

I don't know why it's starting msgid with an empty string.

Here's a bash session showing the issue with a code sample that causes xgettext to produce an entry that starts with an empty string:

$ mkdir /tmp/polib_test
$ cd /tmp/polib_test
$ cat > test.c
main( ) {
    printf(gettext("this is a long piece of text that should wrap on to multiple lines this is a long piece of text that should wrap on to multiple lines"))
}
$ xgettext -c test.c 
$ ls
messages.po  test.c
$ cat messages.po 
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2020-11-12 09:24-0500\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <[email protected]>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: test.c:2
#, c-format
msgid ""
"this is a long piece of text that should wrap on to multiple lines this is a "
"long piece of text that should wrap on to multiple lines"
msgstr ""
$ cp messages.po original.po
$ python3
Python 3.9.0+ (default, Oct 19 2020, 09:51:18) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from polib import pofile
>>> po = pofile("messages.po", encoding="utf-8")
>>> po.save()
>>> 
$ diff original.po messages.po 
23,24c23,24
< "this is a long piece of text that should wrap on to multiple lines this is a "
< "long piece of text that should wrap on to multiple lines"
---
> "this is a long piece of text that should wrap on to multiple lines this is a"
> " long piece of text that should wrap on to multiple lines"
$ xgettext --version
xgettext (GNU gettext-tools) 0.19.8.1
Copyright (C) 1995-1998, 2000-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper.

The text was updated successfully, but these errors were encountered:

verhovsky · 2020-11-13T14:51:51Z

Looking at the docs, gettext starts the msgid with an empty line on purpose for "better alignment":

Each of untranslated-string and translated-string respects the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences. When the time comes to write multi-line strings, one should not use escaped newlines. Instead, a closing quote should follow the last character on the line to be continued, and an opening quote should resume the string at the beginning of the following PO file line. For example:
msgid ""
"Here is an example of how one might continue a very long string\n"
"for the common case the string represents multi-line output.\n"
In this example, the empty string is used on the first line, to allow better alignment of the H from the word ‘Here’ over the f from the word ‘for’. In this example, the msgid keyword is followed by three strings, which are meant to be concatenated. Concatenating the empty string does not change the resulting overall string, but it is a way for us to comply with the necessity of msgid to be followed by a string on the same line, while keeping the multi-line presentation left-justified, as we find this to be a cleaner disposition. The empty string could have been omitted, but only if the string starting with ‘Here’ was promoted on the first line, right after msgid.2 It was not really necessary either to switch between the two last quoted strings immediately after the newline ‘\n’, the switch could have occurred after any other character, we just did it this way because it is neater.

https://www.gnu.org/software/gettext/manual/gettext.html#PO-Files

izimobil · 2021-02-20T15:12:27Z

This is due to the differences between python standard library textwrap module and the gettext wrapper.
I don't plan to rewrite from scratch a text wrapper !
If someone comes with a solution for this, please make a pull request.

mondeja · 2021-02-20T15:37:16Z

This issue of the Python bug tracker seems related.

verhovsky · 2021-02-20T15:54:55Z

Is there a xgettext command that re-textwraps a po file into the same format? That would work for my usecase, I could just call that command after every time I save a file with polib.

izimobil · 2021-02-20T17:15:47Z

@verhovsky:

$ msgcat input.po -o output.po
or
$ msgcat input.po -w78 -o output.po

should do the trick.

verhovsky · 2021-02-22T15:03:01Z

For reference, it looks like in the gettext source code, the width is set here

https://git.savannah.gnu.org/cgit/gettext.git/tree/gettext-tools/src/write-po.c?id=cd861ce28d9c2bb98c05ff8b5580bec2c805d4bf#n1007

and then gets passed to ulc_width_linebreaks

https://git.savannah.gnu.org/cgit/gettext.git/tree/gettext-tools/src/write-po.c?id=cd861ce28d9c2bb98c05ff8b5580bec2c805d4bf#n1035

and then the code does some stuff with the result.

ulc_width_linebreaks is defined in gnulib here

https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/unilbrk/ulc-width-linebreaks.c;hb=HEAD

documented here

https://www.gnu.org/software/libunistring/manual/html_node/unilbrk_002eh.html

Ideally, someone would make Python bindings for unilbrk and then re-implement the rest of the code in write-po.c in Python, but I can confirm that using msgcat after every po.save() works just as well. The only caveat is that I had to set the CHARSET of the .po file (otherwise msgcat errors), which I did like like this:

sed -i 's/charset=CHARSET/charset=UTF-8/' messages.po

Then you can just do this:

import subprocess
from polib import pofile

filename = "messages.po"

subprocess.run(["bash", "-c", "command msgcat"], check=True)  # check that we have the msgcat command available

po = pofile(filename, encoding="utf-8")
po.save()
subprocess.run(["msgcat", filename, "-o", filename], check=True)

PS. -w78 is not correct, I got different results from the original, I think -w79 is the right one but not passing it all works as well.

mondeja · 2023-02-22T17:43:27Z

I think that the easiest solution to this problem is to generate Python bindings for the Rust crate textwrap which offers multiplatform Unicode Line Breaking wrapping.

tosky · 2023-02-23T14:27:24Z

I think that the easiest solution to this problem is to generate Python bindings for the Rust crate textwrap which offers multiplatform Unicode Line Breaking wrapping.

As a user of a pure python library, I would disagree on this. Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.

mondeja · 2023-02-23T19:09:40Z

Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.

You can serve wheels for a lot of platforms, is very easy. In fact, I'm thinking on rewriting polib entirely in Rust, it would optimize the library used from Python.

izimobil · 2023-02-23T19:21:03Z

Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.

You can serve wheels for a lot of platforms, is very easy. In fact, I'm thinking on rewriting polib entirely in Rust, it would optimize the library used from Python.

I'm not sure I understand your point, you're talking of a rust rewrite, how in the earth can this solve this particular issue?!

mondeja · 2023-02-23T20:29:42Z

I'm not sure I understand your point, you're talking of a rust rewrite, how in the earth can this solve this particular issue?!

Just suggested to write Python bindings for Rust crate textwrap for its usage in polib to solve this problem. Is a very easy solution that does not involve compilation at installation time.

I understand that some of you are not receptives to the idea, so since I use polib in several of my projects I'm thinking of rewriting it in Rust creating bindings for Python, which gives me, in addition to solving this problem, a considerable performance improvement.

verhovsky changed the title ~~Lines aren't wrapped the same way that xgettext wraps them~~ Lines aren't wrapped the same way that xgettext wraps them when msgid begins with an empty string Nov 12, 2020

verhovsky changed the title ~~Lines aren't wrapped the same way that xgettext wraps them when msgid begins with an empty string~~ Multiline msgids aren't wrapped the same way that xgettext wraps them Nov 13, 2020

izimobil added enhancement help wanted labels Feb 20, 2021

mondeja mentioned this issue Jul 14, 2021

Use Unicode Line Breaking Algorithm wrapping with po2md mondeja/mdpo#153

Closed

izimobil mentioned this issue May 19, 2023

How to wrap the "msgstr" with the specified width? #139

Closed

izimobil mentioned this issue May 30, 2023

[question] How to get same behaviour of po4a in wrap? #140

Open

burner1024 mentioned this issue May 31, 2023

Look into diff noise BGforgeNet/msg2po#10

Open

cacrespo mentioned this issue Dec 11, 2023

script to search for and complete probable 'index entries' python/python-docs-es#2706

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiline msgids aren't wrapped the same way that xgettext wraps them #96

Multiline msgids aren't wrapped the same way that xgettext wraps them #96

verhovsky commented Nov 12, 2020 •

edited

Loading

verhovsky commented Nov 13, 2020 •

edited

Loading

izimobil commented Feb 20, 2021

mondeja commented Feb 20, 2021 •

edited

Loading

verhovsky commented Feb 20, 2021

izimobil commented Feb 20, 2021 •

edited

Loading

verhovsky commented Feb 22, 2021 •

edited

Loading

mondeja commented Feb 22, 2023

tosky commented Feb 23, 2023

mondeja commented Feb 23, 2023

izimobil commented Feb 23, 2023

mondeja commented Feb 23, 2023

Multiline msgids aren't wrapped the same way that xgettext wraps them #96

Multiline msgids aren't wrapped the same way that xgettext wraps them #96

Comments

verhovsky commented Nov 12, 2020 • edited Loading

verhovsky commented Nov 13, 2020 • edited Loading

izimobil commented Feb 20, 2021

mondeja commented Feb 20, 2021 • edited Loading

verhovsky commented Feb 20, 2021

izimobil commented Feb 20, 2021 • edited Loading

verhovsky commented Feb 22, 2021 • edited Loading

mondeja commented Feb 22, 2023

tosky commented Feb 23, 2023

mondeja commented Feb 23, 2023

izimobil commented Feb 23, 2023

mondeja commented Feb 23, 2023

verhovsky commented Nov 12, 2020 •

edited

Loading

verhovsky commented Nov 13, 2020 •

edited

Loading

mondeja commented Feb 20, 2021 •

edited

Loading

izimobil commented Feb 20, 2021 •

edited

Loading

verhovsky commented Feb 22, 2021 •

edited

Loading