Discussion:
Word wrapping problem
(too old to reply)
Lin Wei
2013-03-30 10:48:59 UTC
Permalink
Hi there,

Word wrapping doesn't work perfectly when I use LyX with both Chinese and
English.
Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
identified for over 5 years but still left unclosed. I'm trying fix it
myself but get stuck in locating the source file of word wrapping. So I'm
wondering if you can tell me which part of the source code and how it deals
with word wrapping...

Thx a lot.

Best,
Lin
Richard Heck
2013-03-30 15:04:02 UTC
Permalink
On 03/30/2013 06:48 AM, Lin Wei wrote:
> Hi there,
>
> Word wrapping doesn't work perfectly when I use LyX with both Chinese
> and English.
> Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
> identified for over 5 years but still left unclosed. I'm trying fix it
> myself but get stuck in locating the source file of word wrapping. So
> I'm wondering if you can tell me which part of the source code and how
> it deals with word wrapping...

I'm not an expert in this part of the code, but I believe most of this
gets done in the calculation of row metrics, in RowMetrics.cpp.

Richard
Lin Wei
2013-04-06 00:35:37 UTC
Permalink
Thanks a lot :)


On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck <***@lyx.org> wrote:

> On 03/30/2013 06:48 AM, Lin Wei wrote:
>
> Hi there,
>
> Word wrapping doesn't work perfectly when I use LyX with both Chinese
> and English.
> Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
> identified for over 5 years but still left unclosed. I'm trying fix it
> myself but get stuck in locating the source file of word wrapping. So I'm
> wondering if you can tell me which part of the source code and how it deals
> with word wrapping...
>
>
> I'm not an expert in this part of the code, but I believe most of this
> gets done in the calculation of row metrics, in RowMetrics.cpp.
>
> Richard
>
>
Lin Wei
2013-04-06 01:01:37 UTC
Permalink
By the way, sorry for my late reply...


On Sat, Apr 6, 2013 at 8:35 AM, Lin Wei <***@gmail.com> wrote:

> Thanks a lot :)
>
>
> On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck <***@lyx.org> wrote:
>
>> On 03/30/2013 06:48 AM, Lin Wei wrote:
>>
>> Hi there,
>>
>> Word wrapping doesn't work perfectly when I use LyX with both Chinese
>> and English.
>> Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
>> identified for over 5 years but still left unclosed. I'm trying fix it
>> myself but get stuck in locating the source file of word wrapping. So I'm
>> wondering if you can tell me which part of the source code and how it deals
>> with word wrapping...
>>
>>
>> I'm not an expert in this part of the code, but I believe most of this
>> gets done in the calculation of row metrics, in RowMetrics.cpp.
>>
>> Richard
>>
>>
>
Lin Wei
2013-04-07 01:25:43 UTC
Permalink
But...there is actually no RowMetrics.cpp......


On Sat, Apr 6, 2013 at 9:01 AM, Lin Wei <***@gmail.com> wrote:

> By the way, sorry for my late reply...
>
>
> On Sat, Apr 6, 2013 at 8:35 AM, Lin Wei <***@gmail.com> wrote:
>
>> Thanks a lot :)
>>
>>
>> On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck <***@lyx.org> wrote:
>>
>>> On 03/30/2013 06:48 AM, Lin Wei wrote:
>>>
>>> Hi there,
>>>
>>> Word wrapping doesn't work perfectly when I use LyX with both Chinese
>>> and English.
>>> Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
>>> identified for over 5 years but still left unclosed. I'm trying fix it
>>> myself but get stuck in locating the source file of word wrapping. So I'm
>>> wondering if you can tell me which part of the source code and how it deals
>>> with word wrapping...
>>>
>>>
>>> I'm not an expert in this part of the code, but I believe most of this
>>> gets done in the calculation of row metrics, in RowMetrics.cpp.
>>>
>>> Richard
>>>
>>>
>>
>
Lin Wei
2013-04-07 01:30:15 UTC
Permalink
Sorry to bother again. Just want to add more details about the word
wrapping problem <http://www.lyx.org/trac/ticket/4635> I mentioned
before<http://www.mail-archive.com/lyx-***@lists.lyx.org/msg177765.html>.
It looks like this, huge space between mixture of CJK languages and
English, which is really annoying when citing English written literatures.

[image: Inline image 1]

Wondering if anyone is working on this problem. Maybe we can come up with a
solution together.


TO Richard Heck:
I didn't find RowMetrics.cpp. Thanks, though.

Regards,
Lin


On Sun, Apr 7, 2013 at 9:25 AM, Lin Wei <***@gmail.com> wrote:

> But...there is actually no RowMetrics.cpp......
>
>
> On Sat, Apr 6, 2013 at 9:01 AM, Lin Wei <***@gmail.com> wrote:
>
>> By the way, sorry for my late reply...
>>
>>
>> On Sat, Apr 6, 2013 at 8:35 AM, Lin Wei <***@gmail.com> wrote:
>>
>>> Thanks a lot :)
>>>
>>>
>>> On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck <***@lyx.org> wrote:
>>>
>>>> On 03/30/2013 06:48 AM, Lin Wei wrote:
>>>>
>>>> Hi there,
>>>>
>>>> Word wrapping doesn't work perfectly when I use LyX with both Chinese
>>>> and English.
>>>> Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
>>>> identified for over 5 years but still left unclosed. I'm trying fix it
>>>> myself but get stuck in locating the source file of word wrapping. So I'm
>>>> wondering if you can tell me which part of the source code and how it deals
>>>> with word wrapping...
>>>>
>>>>
>>>> I'm not an expert in this part of the code, but I believe most of this
>>>> gets done in the calculation of row metrics, in RowMetrics.cpp.
>>>>
>>>> Richard
>>>>
>>>>
>>>
>>
>
Scott Kostyshak
2013-04-07 01:40:24 UTC
Permalink
On Sat, Apr 6, 2013 at 9:30 PM, Lin Wei <***@gmail.com> wrote:

>
> TO Richard Heck:
> I didn't find RowMetrics.cpp. Thanks, though.
>

Hi Lin Wei,

I know nothing about this, but maybe take a look at
TextMetrics::computeRowMetrics?

src/TextMetrics.cpp:void TextMetrics::computeRowMetrics(pit_type const pit,
src/TextMetrics.h: void computeRowMetrics(pit_type pit, Row & row, int
width) const;

Best,

Scott
Lin Wei
2013-04-09 07:25:55 UTC
Permalink
It seems the following function decides where to break within a paragraph:
TextMetrics.cpp: pos_type TextMetrics::rowBreakPoint(int width, pit_type
const pit, pos_type pos) const

But still, I didn't really figure out how it works...What I'm puzzled about
the codes is how it knows the exact breaking point as it iterates to then
end of a row width but is still in the middle of a word. Say a word
"itshardtowrapaword" is at the end of a line and the position iterator now
pointing to 'p'. Then we find we are now at the end of the line, which
means the whole words needs to be wrapped. How did the codes achieve that?

Becides, I think the problem of Lyx with Chinese is that it view all
Chinese characters as just one word so long as no space/newline appears. A
main feature of Chinese and Japanese is that they generally don't use any
space within words or sentences. So a sentence like "***(Bob
2010)***********" would be treated as two words, "**(Bob" and
"2010)***********", thus incorrectly wrapped if the second so-called word
is too long.

Thanks a lot and looking for more updates from you.

Btw....Sorry that I don't know the convention in developing free software,
but should I reply only to the mailing list or cc to everyone replied me as
well?

Best,
Lin


On Sun, Apr 7, 2013 at 9:40 AM, Scott Kostyshak <***@lyx.org> wrote:

> On Sat, Apr 6, 2013 at 9:30 PM, Lin Wei <***@gmail.com> wrote:
>
>>
>> TO Richard Heck:
>> I didn't find RowMetrics.cpp. Thanks, though.
>>
>
> Hi Lin Wei,
>
> I know nothing about this, but maybe take a look at
> TextMetrics::computeRowMetrics?
>
> src/TextMetrics.cpp:void TextMetrics::computeRowMetrics(pit_type const pit,
> src/TextMetrics.h: void computeRowMetrics(pit_type pit, Row & row,
> int width) const;
>
> Best,
>
> Scott
>
Jean-Marc Lasgouttes
2013-04-09 09:06:05 UTC
Permalink
09/04/2013 09:25, Lin Wei:
> It seems the following function decides where to break within a paragraph:
> TextMetrics.cpp: pos_type TextMetrics::rowBreakPoint(int width,
> pit_type const pit,pos_type pos) const

Yes.

> But still, I didn't really figure out how it works...What I'm puzzled
> about the codes is how it knows the exact breaking point as it iterates
> to then end of a row width but is still in the middle of a word. Say a
> word "itshardtowrapaword" is at the end of a line and the position
> iterator now pointing to 'p'. Then we find we are now at the end of the
> line, which means the whole words needs to be wrapped. How did the codes
> achieve that?

As far as I understand, there is a variable named "point" that keeps
track of the last possible break point. This is what get used when the
algorithm realizes that one given word is too long.

> Becides, I think the problem of Lyx with Chinese is that it view all
> Chinese characters as just one word so long as no space/newline appears.
> A main feature of Chinese and Japanese is that they generally don't use
> any space within words or sentences. So a sentence like "***(Bob
> 2010)***********" would be treated as two words, "**(Bob" and
> "2010)***********", thus incorrectly wrapped if the second so-called
> word is too long.

I agree with the analysis, but I do not know what the correct algorithm
is. There is a Qt tool for that
http://doc.qt.digia.com/4.6/qtextboundaryfinder.html
but I am not sure that we can use it directly. It may be possible to use
it on strings between insets and handle insets by ourselves.

> Btw....Sorry that I don't know the convention in developing free
> software, but should I reply only to the mailing list or cc to everyone
> replied me as well?

You can just answer to the list.

Regards,
JMarc
pdv
2013-04-16 21:48:09 UTC
Permalink
In article <***@lyx.org>,
Jean-Marc Lasgouttes <***@lyx.org> wrote:

> 09/04/2013 09:25, Lin Wei:
> > It seems the following function decides where to break within a paragraph:
> > TextMetrics.cpp: pos_type TextMetrics::rowBreakPoint(int width,
> > pit_type const pit,pos_type pos) const
>
> Yes.
>
> > But still, I didn't really figure out how it works...What I'm puzzled
> > about the codes is how it knows the exact breaking point as it iterates
> > to then end of a row width but is still in the middle of a word. Say a
> > word "itshardtowrapaword" is at the end of a line and the position
> > iterator now pointing to 'p'. Then we find we are now at the end of the
> > line, which means the whole words needs to be wrapped. How did the codes
> > achieve that?
>
> As far as I understand, there is a variable named "point" that keeps
> track of the last possible break point. This is what get used when the
> algorithm realizes that one given word is too long.
>
> > Becides, I think the problem of Lyx with Chinese is that it view all
> > Chinese characters as just one word so long as no space/newline appears.
> > A main feature of Chinese and Japanese is that they generally don't use
> > any space within words or sentences. So a sentence like "***(Bob
> > 2010)***********" would be treated as two words, "**(Bob" and
> > "2010)***********", thus incorrectly wrapped if the second so-called
> > word is too long.
>
> I agree with the analysis, but I do not know what the correct algorithm
> is. There is a Qt tool for that
> http://doc.qt.digia.com/4.6/qtextboundaryfinder.html
> but I am not sure that we can use it directly. It may be possible to use
> it on strings between insets and handle insets by ourselves.
>
> > Btw....Sorry that I don't know the convention in developing free
> > software, but should I reply only to the mailing list or cc to everyone
> > replied me as well?
>
> You can just answer to the list.
>
> Regards,
> JMarc

Hi Jin,

Jean-Marc drew my attention to this thread.
I'm working on another problem (the slow scrolling problem) and this
involves also the TextMetrics::rowBreakPoint() function.

As Jean-Marc explained the function keeps track of the last breakpoint
and then tries to add as much of the next word as possible; if there is
still enough space the breakpoint is moved otherwise the function
returns the last breakpoint. However if a very long word is inserted the
function will break within that word. That's probably what happens if
you insert chinese (without any spaces).

If there are no spaces between words, I suppose "one" must understand
the meaning of the words to know where to break? Or is there another way
to find the possible breakpoints?

Regards,

P. De Visschere
Lin Wei
2013-05-05 07:58:29 UTC
Permalink
Still have some doubt about the algorithm for word wrap.

TextMetrics.cpp pos_type TextMetrics::rowBreakPoint(int width, pit_type
const pit, pos_type pos) const

This function seems to iterate through until the right margin or end of par
is reached. I'm wondering whether it iterates a word per pass or just a
character. Furthermore, If it is a word, how does LyX wrap a long word to
the next row?

Thx.

Best,
Lin


On Tue, Apr 9, 2013 at 3:25 PM, Lin Wei <***@gmail.com> wrote:

> It seems the following function decides where to break within a paragraph:
> TextMetrics.cpp: pos_type TextMetrics::rowBreakPoint(int width, pit_type
> const pit, pos_type pos) const
>
> But still, I didn't really figure out how it works...What I'm puzzled
> about the codes is how it knows the exact breaking point as it iterates to
> then end of a row width but is still in the middle of a word. Say a word
> "itshardtowrapaword" is at the end of a line and the position iterator now
> pointing to 'p'. Then we find we are now at the end of the line, which
> means the whole words needs to be wrapped. How did the codes achieve that?
>
> Becides, I think the problem of Lyx with Chinese is that it view all
> Chinese characters as just one word so long as no space/newline appears. A
> main feature of Chinese and Japanese is that they generally don't use any
> space within words or sentences. So a sentence like "***(Bob
> 2010)***********" would be treated as two words, "**(Bob" and
> "2010)***********", thus incorrectly wrapped if the second so-called word
> is too long.
>
> Thanks a lot and looking for more updates from you.
>
> Btw....Sorry that I don't know the convention in developing free software,
> but should I reply only to the mailing list or cc to everyone replied me as
> well?
>
> Best,
> Lin
>
>
> On Sun, Apr 7, 2013 at 9:40 AM, Scott Kostyshak <***@lyx.org> wrote:
>
>> On Sat, Apr 6, 2013 at 9:30 PM, Lin Wei <***@gmail.com> wrote:
>>
>>>
>>> TO Richard Heck:
>>> I didn't find RowMetrics.cpp. Thanks, though.
>>>
>>
>> Hi Lin Wei,
>>
>> I know nothing about this, but maybe take a look at
>> TextMetrics::computeRowMetrics?
>>
>> src/TextMetrics.cpp:void TextMetrics::computeRowMetrics(pit_type const
>> pit,
>> src/TextMetrics.h: void computeRowMetrics(pit_type pit, Row & row,
>> int width) const;
>>
>> Best,
>>
>> Scott
>>
>
>
Scott Kostyshak
2013-07-19 22:17:11 UTC
Permalink
On Sun, May 5, 2013 at 3:58 AM, Lin Wei <***@gmail.com> wrote:
> Still have some doubt about the algorithm for word wrap.
>
> TextMetrics.cpp pos_type TextMetrics::rowBreakPoint(int width, pit_type
> const pit, pos_type pos) const
>
> This function seems to iterate through until the right margin or end of par
> is reached. I'm wondering whether it iterates a word per pass or just a
> character. Furthermore, If it is a word, how does LyX wrap a long word to
> the next row?
>

Hi Lin Wei,

Did you make any progress on this? I think this issue just came up again here:
http://www.mail-archive.com/lyx-***@lists.lyx.org/msg96125.html

Best,

Scott
Jean-Marc Lasgouttes
2013-07-19 22:27:22 UTC
Permalink
Le 20/07/2013 00:17, Scott Kostyshak a écrit :
> Did you make any progress on this? I think this issue just came up again here:
> http://www.mail-archive.com/lyx-***@lists.lyx.org/msg96125.html

Note that I am in the process of rewriting/devastating the code that
breaks text in rows (branch features/str-metrics, not much useful to see
right now). The solution to the problem may be easier then, but we need
to know what are the precise rules we are supposed to follow.

JMarc
Cyrille Artho
2013-07-24 08:03:19 UTC
Permalink
Hi Jean-Marc,
The good news is: You don't need to recognize word boundaries in
Japanese (and I think also in Chinese) text. Just break the text
whenever you are at the end of a line.

So typically text will appear as a rectangular block with some more text
on the final line:

******
******
******
******
***

This makes breaking lines rather simple, but of course that needs to
work within a general framework where other languages (and their
hyphenation rules) are also used.

Jean-Marc Lasgouttes wrote:
> Le 20/07/2013 00:17, Scott Kostyshak a écrit :
>> Did you make any progress on this? I think this issue just came up
>> again here:
>> http://www.mail-archive.com/lyx-***@lists.lyx.org/msg96125.html
>
> Note that I am in the process of rewriting/devastating the code that
> breaks text in rows (branch features/str-metrics, not much useful to see
> right now). The solution to the problem may be easier then, but we need
> to know what are the precise rules we are supposed to follow.
>
> JMarc
>

--
Regards,
Cyrille Artho - http://artho.com/
Those who will not reason, are bigots, those who cannot,
are fools, and those who dare not, are slaves.
-- George Gordon Noel Byron
Lin Wei
2013-07-24 07:39:28 UTC
Permalink
Sorry for late reply. I've been volunteer teaching in undeveloped areas
where I can only check my email remittently.
Not really more progress. I asked further question and got no reply....So I
kind of give up......

Wei Lin


On Sat, Jul 20, 2013 at 6:17 AM, Scott Kostyshak <***@lyx.org> wrote:

> On Sun, May 5, 2013 at 3:58 AM, Lin Wei <***@gmail.com> wrote:
> > Still have some doubt about the algorithm for word wrap.
> >
> > TextMetrics.cpp pos_type TextMetrics::rowBreakPoint(int width,
> pit_type
> > const pit, pos_type pos) const
> >
> > This function seems to iterate through until the right margin or end of
> par
> > is reached. I'm wondering whether it iterates a word per pass or just a
> > character. Furthermore, If it is a word, how does LyX wrap a long word to
> > the next row?
> >
>
> Hi Lin Wei,
>
> Did you make any progress on this? I think this issue just came up again
> here:
> http://www.mail-archive.com/lyx-***@lists.lyx.org/msg96125.html
>
> Best,
>
> Scott
>
Scott Kostyshak
2013-07-24 08:12:08 UTC
Permalink
On Wed, Jul 24, 2013 at 3:39 AM, Lin Wei <***@gmail.com> wrote:
> Sorry for late reply. I've been volunteer teaching in undeveloped areas
> where I can only check my email remittently.
> Not really more progress. I asked further question and got no reply....So I
> kind of give up......

Hi Lin Wei,

I'm sorry that you did not get your question addressed. In general,
you should not give up so fast. You should bump your email or ask a
more specific question. My questions on this list are often (very
understandably) ignored. I just keep bugging people :)

In this specific case though maybe you were right to give up because
it seems like a complicated issue. One of the developers is working on
something that might make this easier. Would you be interested in
helping test his branch of the code to see how it works with Chinese?
I imagine this would be very useful because we don't have many Chinese
users around here for testing, but I don't actually know. We could ask
him if you are interested.

Scott
Lin Wei
2013-07-28 02:39:16 UTC
Permalink
Hi Scott,

Yep. I did realize that might be a complicated issue after check the code
and qt4 tool the other developer recommended. Surely I want to help. It
would be really nice if you could connect me and that developer. The only
problem is I might not be able to be fully devoted since I would be
volunteering for the next whole year. Is it fine?

Best,
Wei Lin


On Wed, Jul 24, 2013 at 4:12 PM, Scott Kostyshak <***@lyx.org> wrote:

> On Wed, Jul 24, 2013 at 3:39 AM, Lin Wei <***@gmail.com> wrote:
> > Sorry for late reply. I've been volunteer teaching in undeveloped areas
> > where I can only check my email remittently.
> > Not really more progress. I asked further question and got no
> reply....So I
> > kind of give up......
>
> Hi Lin Wei,
>
> I'm sorry that you did not get your question addressed. In general,
> you should not give up so fast. You should bump your email or ask a
> more specific question. My questions on this list are often (very
> understandably) ignored. I just keep bugging people :)
>
> In this specific case though maybe you were right to give up because
> it seems like a complicated issue. One of the developers is working on
> something that might make this easier. Would you be interested in
> helping test his branch of the code to see how it works with Chinese?
> I imagine this would be very useful because we don't have many Chinese
> users around here for testing, but I don't actually know. We could ask
> him if you are interested.
>
> Scott
>
Scott Kostyshak
2013-07-28 06:05:21 UTC
Permalink
On Sat, Jul 27, 2013 at 10:39 PM, Lin Wei <***@gmail.com> wrote:
> Hi Scott,
>
> Yep. I did realize that might be a complicated issue after check the code
> and qt4 tool the other developer recommended. Surely I want to help. It
> would be really nice if you could connect me and that developer. The only
> problem is I might not be able to be fully devoted since I would be
> volunteering for the next whole year. Is it fine?

Hi Wei Lin,

Thanks for checking back. I think that now that we understand the
rules for line breaking, it's just a matter of implementation. I could
be wrong, but that's my interpretation.

Best,

Scott
Jean-Marc Lasgouttes
2013-07-24 13:33:39 UTC
Permalink
24/07/2013 09:39, Lin Wei:
> Sorry for late reply. I've been volunteer teaching in undeveloped areas
> where I can only check my email remittently.
> Not really more progress. I asked further question and got no
> reply....So I kind of give up......

Dear Lin Wei,

I am sorry that I have not been as responsive as necessary. Actually, at
the time I was still trying to understand how the row breaking algorithm
works. Now I have partly rewritten is in branch features/str-metrics
(method is now named breakRow), with the goal of computing metrics on
who;e strings to avoid problems related to ligatures and kerning.

If we forget about insets, the algorithm is just to collect characters
until a space is found and possibly break the row at this point. If I
understand correctly, for Japanese or Chinese one could just break the
row as soon as a character goes beyond the margin. I am wary of
computing width of strings in an iterative way (a, then ab, then abc,
then abcd...). Is it OK in Chinese and Japanese to compute the string
length as sum of character lengths? (that is, are there kernings and
ligatures in these languages?)

Another question is: how do we recognize Chinese and Japanese
characters? I guess they live in particular Unicode ranges.

If things are really complicated, we could choose to rely on Qt's
QTextBoundaryFinder, although this might be more complicated.

Hope this helps.

JMarc
Yihui Xie
2013-07-24 18:55:17 UTC
Permalink
As far as I know, there are no kernings and ligatures in Chinese. All
Chinese characters are "independent" and of exactly the same width, so
it is OK to calculate the string length by simply counting the number
of characters.

This post might help for the Unicode ranges:
http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode

One issue to keep in mind is that when you deal with a mixture of
Chinese and ASCII characters, different rules should be applied
depending on which characters are on the margin, e.g. suppose
"你好hello" reaches the margin, and you can break the Chinese phrase:

[...]你
好hello[...]

or break between Chinese and English:

[...]你好
hello[...]

or break after English:

[...]你好hello
[...]

but you cannot break the English word like

[...]你好he
llo[...]


Regards,
Yihui
--
Yihui Xie <***@gmail.com>
Phone: 206-667-4385 Web: http://yihui.name
Fred Hutchinson Cancer Research Center, Seattle


On Wed, Jul 24, 2013 at 6:33 AM, Jean-Marc Lasgouttes
<***@lyx.org> wrote:
> 24/07/2013 09:39, Lin Wei:
>
>> Sorry for late reply. I've been volunteer teaching in undeveloped areas
>> where I can only check my email remittently.
>> Not really more progress. I asked further question and got no
>> reply....So I kind of give up......
>
>
> Dear Lin Wei,
>
> I am sorry that I have not been as responsive as necessary. Actually, at the
> time I was still trying to understand how the row breaking algorithm works.
> Now I have partly rewritten is in branch features/str-metrics (method is now
> named breakRow), with the goal of computing metrics on who;e strings to
> avoid problems related to ligatures and kerning.
>
> If we forget about insets, the algorithm is just to collect characters until
> a space is found and possibly break the row at this point. If I understand
> correctly, for Japanese or Chinese one could just break the row as soon as a
> character goes beyond the margin. I am wary of computing width of strings in
> an iterative way (a, then ab, then abc, then abcd...). Is it OK in Chinese
> and Japanese to compute the string length as sum of character lengths? (that
> is, are there kernings and ligatures in these languages?)
>
> Another question is: how do we recognize Chinese and Japanese characters? I
> guess they live in particular Unicode ranges.
>
> If things are really complicated, we could choose to rely on Qt's
> QTextBoundaryFinder, although this might be more complicated.
>
> Hope this helps.
>
> JMarc
>
Guenter Milde
2013-07-24 20:14:18 UTC
Permalink
On 2013-07-24, Yihui Xie wrote:
> As far as I know, there are no kernings and ligatures in Chinese. All
> Chinese characters are "independent" and of exactly the same width, so
> it is OK to calculate the string length by simply counting the number
> of characters.

> This post might help for the Unicode ranges:
> http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode

I did look up all the CJK characters for a somewhat similar problem in
Docutils and came up with the following list:


# Unicode unifies under the term CJK (Chinese, Japanese, Korean) the
# scripts Han, Bopomofo, Hiragana, Katakana, Hangul, and Yi. These
# scripts use ideographs that do not require spaces between words.

# Sources for determination of the "CJK" property are the `Unicode
# standard Chapter 11 East Asian Scripts`__ describing the CJK unification
# and the Unicode data file ``Scripts.txt``.
# __ http://unicode.org/versions/Unicode4.0.0/ch11.pdf
cjk_characters = (
u'\u02EA\u02EB' # Bopomofo modifier letters
u'\u1100-\u11FF' # 1100..11FF; Hangul Jamo
u'\u2E80-\u4DBF' # 2E80..2EFF; CJK Radicals Supplement
# 2F00..2FDF; Kangxi Radicals
# 2FF0..2FFF; Ideographic Description Characters
# 3000..303F; CJK Symbols and Punctuation
# 3040..309F; Hiragana
# 30A0..30FF; Katakana
# 3100..312F; Bopomofo
# 3130..318F; Hangul Compatibility Jamo
# 3190..319F; Kanbun
# 31A0..31BF; Bopomofo Extended
# 31C0..31EF; CJK Strokes
# 31F0..31FF; Katakana Phonetic Extensions
# 3200..32FF; Enclosed CJK Letters and Months
# 3300..33FF; CJK Compatibility
# 3400..4DBF; CJK Unified Ideographs Extension A
u'\u4E00-\uA4CF' # 4E00..9FFF; CJK Unified Ideographs
# A000..A48F; Yi Syllables
# A490..A4CF; Yi Radicals
u'\uA960-\uA97F' # A960..A97F; Hangul Jamo Extended-A
u'\uAC00-\uD7FF' # AC00..D7AF; Hangul Syllables
# D7B0..D7FF; Hangul Jamo Extended-B
u'\uF900-\uFAFF' # F900..FAFF; CJK Compatibility Ideographs
u'\uFE30-\uFE4F' # FE30..FE4F; CJK Compatibility Forms
u'\uFF00-\uFFEF' # FF00..FFEF; Halfwidth and Fullwidth Forms
u'\U0001B000' # KATAKANA LETTER ARCHAIC E
u'\U0001B001' # HIRAGANA LETTER ARCHAIC YE
u'\U0001F200' # SQUARE HIRAGANA HOKA
u'\U00020000-\U0002FA1F' # 20000..2A6DF; CJK Unified Ideographs Extension B
# 2A700..2B73F; CJK Unified Ideographs Extension C
# 2B740..2B81F; CJK Unified Ideographs Extension D
# 2F800..2FA1F; CJK Compatibility Ideographs Supplement
)

With a regular expression, wrapping could be allowed whenever "boarding" any
of the specified characters.

Hope this helps.

Günter
Jean-Marc Lasgouttes
2013-07-24 20:57:29 UTC
Permalink
Le 24/07/2013 22:14, Guenter Milde a écrit :
> On 2013-07-24, Yihui Xie wrote:
>> As far as I know, there are no kernings and ligatures in Chinese. All
>> Chinese characters are "independent" and of exactly the same width, so
>> it is OK to calculate the string length by simply counting the number
>> of characters.
>
>> This post might help for the Unicode ranges:
>> http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode
>
> I did look up all the CJK characters for a somewhat similar problem in
> Docutils and came up with the following list:

Thanks. BTW, is Korean also affected by these word-breaking rules?

JMarc
Pavel Sanda
2013-07-24 23:09:53 UTC
Permalink
Jean-Marc Lasgouttes wrote:
> Thanks. BTW, is Korean also affected by these word-breaking rules?

Korean has different breaking rules than CJ.
You might want to look at this page:
http://msdn.microsoft.com/en-us/goglobal/bb688158.aspx
Pavel
Jean-Marc Lasgouttes
2013-07-25 08:11:00 UTC
Permalink
25/07/2013 01:09, Pavel Sanda:
> Jean-Marc Lasgouttes wrote:
>> Thanks. BTW, is Korean also affected by these word-breaking rules?
>
> Korean has different breaking rules than CJ.
> You might want to look at this page:
> http://msdn.microsoft.com/en-us/goglobal/bb688158.aspx

The best is probably to give up for now :) I don't feel like reading
pages of unicode specifications to handle languages for which I have no
first hand knowledge. My worlds is only populated of latin0 characters :)

Let's try for now to see whether the current code works. Later, there
will be a need to use something like the Qt service already mentionned
to handle Unicode line breaking in all its glory.

JMarc
Kornel Benko
2013-07-25 08:54:29 UTC
Permalink
Am Donnerstag, 25. Juli 2013 um 10:11:00, schrieb Jean-Marc Lasgouttes <***@lyx.org>
> 25/07/2013 01:09, Pavel Sanda:
> > Jean-Marc Lasgouttes wrote:
> >> Thanks. BTW, is Korean also affected by these word-breaking rules?
> >
> > Korean has different breaking rules than CJ.
> > You might want to look at this page:
> > http://msdn.microsoft.com/en-us/goglobal/bb688158.aspx
>
> The best is probably to give up for now :) I don't feel like reading
> pages of unicode specifications to handle languages for which I have no
> first hand knowledge. My worlds is only populated of latin0 characters :)
>
> Let's try for now to see whether the current code works. Later, there
> will be a need to use something like the Qt service already mentionned
> to handle Unicode line breaking in all its glory.
>
> JMarc

Each unicode for korean displays a syllable. Words are separated by space. As it is
now, it looks good, as we break lines on space.

This is only IMHO, but as my wife is korean, I feel confident.

Kornel
Jean-Marc Lasgouttes
2013-07-25 08:57:54 UTC
Permalink
25/07/2013 10:54, Kornel Benko:
> Each unicode for korean displays a syllable. Words are separated by
> space. As it is
>
> now, it looks good, as we break lines on space.
>
> This is only IMHO, but as my wife is korean, I feel confident.

Thanks for the information.

JMarc
Vincent van Ravesteijn
2013-07-24 19:24:46 UTC
Permalink
Op 24-7-2013 20:55, Yihui Xie schreef:
> As far as I know, there are no kernings and ligatures in Chinese. All
> Chinese characters are "independent" and of exactly the same width, so
> it is OK to calculate the string length by simply counting the number
> of characters.
>
> This post might help for the Unicode ranges:
> http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode
>
> One issue to keep in mind is that when you deal with a mixture of
> Chinese and ASCII characters, different rules should be applied
> depending on which characters are on the margin, e.g. suppose
> "你好hello" reaches the margin, and you can break the Chinese phrase:
>
> [...]你
> 好hello[...]

In this case, the two chinese characters are two separate words. If the
two characters form a single word, can you then also break in the middle
of the word, i.e. between the two characters ?

Vincent
Yihui Xie
2013-07-24 19:39:35 UTC
Permalink
In Chinese typesetting, we do not care if two characters are from the
same word. We break the words whenever they reach the page margin. So
yes, we can break in the middle of a word.

Regards,
Yihui
--
Yihui Xie <***@gmail.com>
Phone: 206-667-4385 Web: http://yihui.name
Fred Hutchinson Cancer Research Center, Seattle


On Wed, Jul 24, 2013 at 12:24 PM, Vincent van Ravesteijn <***@lyx.org> wrote:
> Op 24-7-2013 20:55, Yihui Xie schreef:
>> As far as I know, there are no kernings and ligatures in Chinese. All
>> Chinese characters are "independent" and of exactly the same width, so
>> it is OK to calculate the string length by simply counting the number
>> of characters.
>>
>> This post might help for the Unicode ranges:
>> http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode
>>
>> One issue to keep in mind is that when you deal with a mixture of
>> Chinese and ASCII characters, different rules should be applied
>> depending on which characters are on the margin, e.g. suppose
>> "你好hello" reaches the margin, and you can break the Chinese phrase:
>>
>> [...]你
>> 好hello[...]
>
> In this case, the two chinese characters are two separate words. If the
> two characters form a single word, can you then also break in the middle
> of the word, i.e. between the two characters ?
>
> Vincent
>
>
Jean-Marc Lasgouttes
2013-07-24 19:26:13 UTC
Permalink
Le 24/07/2013 20:55, Yihui Xie a écrit :
> One issue to keep in mind is that when you deal with a mixture of
> Chinese and ASCII characters, different rules should be applied
> depending on which characters are on the margin, e.g. suppose
> "你好hello" reaches the margin, and you can break the Chinese phrase:

Thanks. Do I have to care about unicode ranges or can I look at the
language only? THat is, does it make sense to write 你好hello, with
hello marked as chinese?

JMarc
Yihui Xie
2013-07-24 19:49:40 UTC
Permalink
In summary, you can break the line in these places:

1. in the middle of any Chinese characters;
2. at the word boundaries of other languages (e.g. spaces in English);

If you mark "hello" as Chinese, is it possible not to break in the
middle of "hello"?

Regards,
Yihui
--
Yihui Xie <***@gmail.com>
Phone: 206-667-4385 Web: http://yihui.name
Fred Hutchinson Cancer Research Center, Seattle


On Wed, Jul 24, 2013 at 12:26 PM, Jean-Marc Lasgouttes
<***@lyx.org> wrote:
> Le 24/07/2013 20:55, Yihui Xie a écrit :
>
>> One issue to keep in mind is that when you deal with a mixture of
>> Chinese and ASCII characters, different rules should be applied
>> depending on which characters are on the margin, e.g. suppose
>> "你好hello" reaches the margin, and you can break the Chinese phrase:
>
>
> Thanks. Do I have to care about unicode ranges or can I look at the
> language only? THat is, does it make sense to write 你好hello, with
> hello marked as chinese?
>
> JMarc
>
>
Jean-Marc Lasgouttes
2013-07-24 20:49:26 UTC
Permalink
Le 24/07/2013 21:49, Yihui Xie a écrit :
> In summary, you can break the line in these places:
>
> 1. in the middle of any Chinese characters;
> 2. at the word boundaries of other languages (e.g. spaces in English);

And between English and Chinese enven without separator, right?

I will try to push some experimental code tonight.

> If you mark "hello" as Chinese, is it possible not to break in the
> middle of "hello"?

For now I will act only on language, not characters. We'll see where
that gets us.

JMarc
Yihui Xie
2013-07-24 20:58:17 UTC
Permalink
On Wed, Jul 24, 2013 at 1:49 PM, Jean-Marc Lasgouttes
<***@lyx.org> wrote:
> Le 24/07/2013 21:49, Yihui Xie a écrit :
>
>> In summary, you can break the line in these places:
>>
>> 1. in the middle of any Chinese characters;
>> 2. at the word boundaries of other languages (e.g. spaces in English);
>
>
> And between English and Chinese enven without separator, right?
>
> I will try to push some experimental code tonight.
>

That is right.


Regards,
Yihui
--
Yihui Xie <***@gmail.com>
Phone: 206-667-4385 Web: http://yihui.name
Fred Hutchinson Cancer Research Center, Seattle
Jean-Marc Lasgouttes
2013-07-24 22:00:15 UTC
Permalink
Le 24/07/2013 22:49, Jean-Marc Lasgouttes a écrit :
> Le 24/07/2013 21:49, Yihui Xie a écrit :
>> In summary, you can break the line in these places:
>>
>> 1. in the middle of any Chinese characters;
>> 2. at the word boundaries of other languages (e.g. spaces in English);
>
> And between English and Chinese enven without separator, right?
>
> I will try to push some experimental code tonight.

I give up. I will see later.

JMarc
Abdelrazak Younes
2013-04-07 07:07:35 UTC
Permalink
On 07/04/2013 03:30, Lin Wei wrote:
> Sorry to bother again. Just want to add more details about the word
> wrapping problem <http://www.lyx.org/trac/ticket/4635> I mentioned
> before
> <http://www.mail-archive.com/lyx-***@lists.lyx.org/msg177765.html>.
> It looks like this, huge space between mixture of CJK languages and
> English, which is really annoying when citing English written literatures.
>
> Inline image 1
>
> Wondering if anyone is working on this problem. Maybe we can come up
> with a solution together.

I think there's an option somewhere in the preference dialog to justify
left the LyX screen.

Abdel.

>
>
> TO Richard Heck:
> I didn't find RowMetrics.cpp. Thanks, though.
>
> Regards,
> Lin
>
>
> On Sun, Apr 7, 2013 at 9:25 AM, Lin Wei <***@gmail.com
> <mailto:***@gmail.com>> wrote:
>
> But...there is actually no RowMetrics.cpp......
>
>
> On Sat, Apr 6, 2013 at 9:01 AM, Lin Wei <***@gmail.com
> <mailto:***@gmail.com>> wrote:
>
> By the way, sorry for my late reply...
>
>
> On Sat, Apr 6, 2013 at 8:35 AM, Lin Wei <***@gmail.com
> <mailto:***@gmail.com>> wrote:
>
> Thanks a lot :)
>
>
> On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck
> <***@lyx.org <mailto:***@lyx.org>> wrote:
>
> On 03/30/2013 06:48 AM, Lin Wei wrote:
>> Hi there,
>>
>> Word wrapping doesn't work perfectly when I use LyX
>> with both Chinese and English.
>> Seems this bug/defect
>> <http://www.lyx.org/trac/ticket/4635> has been
>> identified for over 5 years but still left unclosed.
>> I'm trying fix it myself but get stuck in locating
>> the source file of word wrapping. So I'm wondering if
>> you can tell me which part of the source code and how
>> it deals with word wrapping...
>
> I'm not an expert in this part of the code, but I
> believe most of this gets done in the calculation of
> row metrics, in RowMetrics.cpp.
>
> Richard
>
>
>
>
>
Abdelrazak Younes
2013-04-07 07:08:19 UTC
Permalink
On 07/04/2013 03:30, Lin Wei wrote:
> Sorry to bother again. Just want to add more details about the word
> wrapping problem <http://www.lyx.org/trac/ticket/4635> I mentioned
> before
> <http://www.mail-archive.com/lyx-***@lists.lyx.org/msg177765.html>.
> It looks like this, huge space between mixture of CJK languages and
> English, which is really annoying when citing English written literatures.
>
> Inline image 1
>
> Wondering if anyone is working on this problem. Maybe we can come up
> with a solution together.
>
>
> TO Richard Heck:
> I didn't find RowMetrics.cpp. Thanks, though.

Try ParagraphMetrics.cpp and/or Row.cpp

Abdel.

>
> Regards,
> Lin
>
>
> On Sun, Apr 7, 2013 at 9:25 AM, Lin Wei <***@gmail.com
> <mailto:***@gmail.com>> wrote:
>
> But...there is actually no RowMetrics.cpp......
>
>
> On Sat, Apr 6, 2013 at 9:01 AM, Lin Wei <***@gmail.com
> <mailto:***@gmail.com>> wrote:
>
> By the way, sorry for my late reply...
>
>
> On Sat, Apr 6, 2013 at 8:35 AM, Lin Wei <***@gmail.com
> <mailto:***@gmail.com>> wrote:
>
> Thanks a lot :)
>
>
> On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck
> <***@lyx.org <mailto:***@lyx.org>> wrote:
>
> On 03/30/2013 06:48 AM, Lin Wei wrote:
>> Hi there,
>>
>> Word wrapping doesn't work perfectly when I use LyX
>> with both Chinese and English.
>> Seems this bug/defect
>> <http://www.lyx.org/trac/ticket/4635> has been
>> identified for over 5 years but still left unclosed.
>> I'm trying fix it myself but get stuck in locating
>> the source file of word wrapping. So I'm wondering if
>> you can tell me which part of the source code and how
>> it deals with word wrapping...
>
> I'm not an expert in this part of the code, but I
> believe most of this gets done in the calculation of
> row metrics, in RowMetrics.cpp.
>
> Richard
>
>
>
>
>
Lin Wei
2013-04-09 07:19:45 UTC
Permalink
It seems the following function decides where to break within a paragraph:
TextMetrics.cpp: pos_type TextMetrics::rowBreakPoint(int width, pit_type
const pit, pos_type pos) const

But still, I didn't really figure out how it works...What I'm puzzled about
the codes is how it knows the exact breaking point as it iterates to then
end of a row width but is still in the middle of a word. Say a word
"itshardtowrapaword" is at the end of a line and the position iterator now
pointing to 'p'. Then we find we are now at the end of the line, which
means the whole words needs to be wrapped. How did the codes achieve that?

Becides, I think the problem of Lyx with Chinese is that it view all
Chinese characters as just one word so long as no space/newline appears. A
main feature of Chinese and Japanese is that they generally don't use any
space within words or sentences. So a sentence like "***(Bob
2010)***********" would be treated as two words, "**(Bob" and
"2010)***********", thus incorrectly wrapped if the second so-called word
is too long.

Thanks a lot and looking for more updates from you.

Best,
Lin


On Sun, Apr 7, 2013 at 3:08 PM, Abdelrazak Younes <***@lyx.org> wrote:

> On 07/04/2013 03:30, Lin Wei wrote:
>
> Sorry to bother again. Just want to add more details about the word
> wrapping problem <http://www.lyx.org/trac/ticket/4635> I mentioned before<http://www.mail-archive.com/lyx-***@lists.lyx.org/msg177765.html>.
> It looks like this, huge space between mixture of CJK languages and
> English, which is really annoying when citing English written literatures.
>
> [image: Inline image 1]
>
> Wondering if anyone is working on this problem. Maybe we can come up
> with a solution together.
>
>
> TO Richard Heck:
> I didn't find RowMetrics.cpp. Thanks, though.
>
>
> Try ParagraphMetrics.cpp and/or Row.cpp
>
> Abdel.
>
>
>
> Regards,
> Lin
>
>
> On Sun, Apr 7, 2013 at 9:25 AM, Lin Wei <***@gmail.com> wrote:
>
>> But...there is actually no RowMetrics.cpp......
>>
>>
>> On Sat, Apr 6, 2013 at 9:01 AM, Lin Wei <***@gmail.com> wrote:
>>
>>> By the way, sorry for my late reply...
>>>
>>>
>>> On Sat, Apr 6, 2013 at 8:35 AM, Lin Wei <***@gmail.com> wrote:
>>>
>>>> Thanks a lot :)
>>>>
>>>>
>>>> On Sat, Mar 30, 2013 at 11:04 PM, Richard Heck <***@lyx.org> wrote:
>>>>
>>>>> On 03/30/2013 06:48 AM, Lin Wei wrote:
>>>>>
>>>>> Hi there,
>>>>>
>>>>> Word wrapping doesn't work perfectly when I use LyX with both
>>>>> Chinese and English.
>>>>> Seems this bug/defect <http://www.lyx.org/trac/ticket/4635> has been
>>>>> identified for over 5 years but still left unclosed. I'm trying fix it
>>>>> myself but get stuck in locating the source file of word wrapping. So I'm
>>>>> wondering if you can tell me which part of the source code and how it deals
>>>>> with word wrapping...
>>>>>
>>>>>
>>>>> I'm not an expert in this part of the code, but I believe most of
>>>>> this gets done in the calculation of row metrics, in RowMetrics.cpp.
>>>>>
>>>>> Richard
>>>>>
>>>>>
>>>>
>>>
>>
>
>
Loading...