[patch for 2.2] silence iconv warnings

Discussion:

[patch for 2.2] silence iconv warnings

Jürgen Spitzmüller

2014-04-06 10:02:28 UTC

When scrolling through a document while instant-spellchecking is enabled
and Hunspell used, LyX spits out iconv errors if a word appears which is
not in the Hunspell dictionary's encoding. E.g.:

Error returned from iconv
EILSEQ An invalid multibyte sequence has been encountered in the input.
When converting from UCS-4LE to ISO8859-1.
Input: 0xc4 0x3 0x0 0x0 0xcd 0x3 0x0 0x0 0xc0 0x3 0x0 0x0 0xbf 0x3 0x0 0x0
0xc2 0x3 0x0 0x0

Since these messages are not very informative and also rather frightening,
the attached patch attempts to catch the error and output something more
understandable in debug mode.

I am not very familiar with iconv/unicode and exception handling, so I
would appreciate a critical review.

Thanks
Jürgen

Georg Baum

2014-04-08 20:42:28 UTC

Jürgen Spitzmüller wrote:

> I am not very familiar with iconv/unicode and exception handling, so I
> would appreciate a critical review.

The change in src/support/unicode.cpp is problematic: It disables all error
handling, not only the lyxerr output. Also, if you now throw an exception
there, you need to make sure that all callers can cope with that. Maybe one
solution would be to throw the exception at the very end (after the error
handling), and give it exactly the error message which is now written to
lyxerr. Then each caller can decide what it wants to do with the error
message.

I would also propose to treat an encoding error as a spelling error: If the
word can't be encoded in the dictionary of the current language, then it
can't be correct, since we assume that Hunspell does not choose an encoding
for the dictionary of a certain lnguage which does not cover all words of
that language.

Georg

Jürgen Spitzmüller

2014-04-09 06:14:47 UTC

2014-04-08 22:42 GMT+02:00 Georg Baum:

> The change in src/support/unicode.cpp is problematic: It disables all error
> handling, not only the lyxerr output. Also, if you now throw an exception
> there, you need to make sure that all callers can cope with that. Maybe one
> solution would be to throw the exception at the very end (after the error
> handling), and give it exactly the error message which is now written to
> lyxerr. Then each caller can decide what it wants to do with the error
> message.
>

Thanks. I feared that. I put the exception that early in order to suppress
the lyxerr message. In this case we need to audit all callers. Will
postpone this.

>
> I would also propose to treat an encoding error as a spelling error: If the
> word can't be encoded in the dictionary of the current language, then it
> can't be correct, since we assume that Hunspell does not choose an encoding
> for the dictionary of a certain lnguage which does not cover all words of
> that language.
>

But then, with instant spellchecker, the word will be underlined and the
user can not change that.

Jürgen

>
>
>
> Georg
>
>
>

Stephan Witt

2014-04-09 06:40:17 UTC

Am 09.04.2014 um 08:14 schrieb Jürgen Spitzmüller <***@lyx.org>:

> 2014-04-08 22:42 GMT+02:00 Georg Baum:
> The change in src/support/unicode.cpp is problematic: It disables all error
> handling, not only the lyxerr output. Also, if you now throw an exception
> there, you need to make sure that all callers can cope with that. Maybe one
> solution would be to throw the exception at the very end (after the error
> handling), and give it exactly the error message which is now written to
> lyxerr. Then each caller can decide what it wants to do with the error
> message.
>
> Thanks. I feared that. I put the exception that early in order to suppress the lyxerr message. In this case we need to audit all callers. Will postpone this.
>
>
> I would also propose to treat an encoding error as a spelling error: If the
> word can't be encoded in the dictionary of the current language, then it
> can't be correct, since we assume that Hunspell does not choose an encoding
> for the dictionary of a certain lnguage which does not cover all words of
> that language.
>
> But then, with instant spellchecker, the word will be underlined and the user can not change that.

Can you provide an example, please? If the word cannot be converted to Hunspell dictionary encoding
the dictionary is broken or the language is not correct, isn't it?

You're right, the user has not many options to get rid of the misspelled marker.
S(he) can change the language of the word or add it to the personal word list for the language.
The personal word list uses UTF-8, it should be possible to store it there.

Stephan

Jürgen Spitzmüller

2014-04-09 06:53:21 UTC

2014-04-09 8:40 GMT+02:00 Stephan Witt <***@gmx.net>:

> Can you provide an example, please? If the word cannot be converted to
> Hunspell dictionary encoding
> the dictionary is broken or the language is not correct, isn't it?
>

Depends on how you define "language". Think of names.

> You're right, the user has not many options to get rid of the misspelled
> marker.
> S(he) can change the language of the word or add it to the personal word
> list for the language.
> The personal word list uses UTF-8, it should be possible to store it there.
>

Maybe. Didn't test.

Jürgen

>
> Stephan

Stephan Witt

2014-04-09 08:59:34 UTC

Am 09.04.2014 um 08:53 schrieb Jürgen Spitzmüller <***@lyx.org>:

> 2014-04-09 8:40 GMT+02:00 Stephan Witt <***@gmx.net>:
> Can you provide an example, please? If the word cannot be converted to Hunspell dictionary encoding
> the dictionary is broken or the language is not correct, isn't it?
>
> Depends on how you define "language". Think of names.

That's a good example. So, my parents are from Hungarian and named me István.
Let's assume the á isn't valid in german iso encoding. Then
* I can change my name to Stephan - to avoid to spell my name on every formal occasion
* if I don't like that I can add István to my "german" personal word list (I didn't test it either)
* or I can change the language of the word "István" to hungarian
* or I have to live with the red misspelled marker

It's not me, BTW :) It's only a fake on purpose.

Stephan

> You're right, the user has not many options to get rid of the misspelled marker.
> S(he) can change the language of the word or add it to the personal word list for the language.
> The personal word list uses UTF-8, it should be possible to store it there.
>
> Maybe. Didn't test.

Jürgen Spitzmüller

2014-04-09 09:12:54 UTC

2014-04-09 10:59 GMT+02:00 Stephan Witt <***@gmx.net>:

>
> That's a good example. So, my parents are from Hungarian and named me
> IstvÃ¡n.
> Let's assume the Ã¡ isn't valid in german iso encoding. Then
> * I can change my name to Stephan - to avoid to spell my name on every
> formal occasion
> * if I don't like that I can add IstvÃ¡n to my "german" personal word list
> (I didn't test it either)
> * or I can change the language of the word "IstvÃ¡n" to hungarian
> * or I have to live with the red misspelled marker
>

This is mixing languages with writing systems, IMHO. In fact language
sometimes has an implication on the spelling of names (if it comes to
transliteration), but with rather surpring effects. For instance, the
Russian name ÐÐŸÐ»ÐŸÌÑÐžÐœÐŸÐ² is usually written VoloÅ¡inov in German, but
Voloshinov in English. Is "Å¡" a "German" character?

Also, I think that marking IstvÃ¡n as "Hungarian" absurds the language
concept.

More technically, I think it will be irritating for users that they can add
"IstvÃ¡n" to the personal dictionary, while "Ignore" and "Ignore all" just
won't work.

JÃŒrgen

>
> It's not me, BTW :) It's only a fake on purpose.

Stephan Witt

2014-04-09 20:45:51 UTC

Am 09.04.2014 um 11:12 schrieb Jürgen Spitzmüller <***@lyx.org>:

> 2014-04-09 10:59 GMT+02:00 Stephan Witt <***@gmx.net>:
>
> That's a good example. So, my parents are from Hungarian and named me István.
> Let's assume the á isn't valid in german iso encoding. Then
> * I can change my name to Stephan - to avoid to spell my name on every formal occasion
> * if I don't like that I can add István to my "german" personal word list (I didn't test it either)
> * or I can change the language of the word "István" to hungarian
> * or I have to live with the red misspelled marker
>
> This is mixing languages with writing systems, IMHO. In fact language sometimes has an implication on the spelling of names (if it comes to transliteration), but with rather surpring effects. For instance, the Russian name Воло́шинов is usually written Vološinov in German, but Voloshinov in English. Is "š" a "German" character?

I'm not a linguist and my knowledge about these things is limited.
The change of language is the only possibility I know of to get out
of the "broken" dictionary encoding scenario.

> Also, I think that marking István as "Hungarian" absurds the language concept.
>
> More technically, I think it will be irritating for users that they can add "István" to the personal dictionary, while "Ignore" and "Ignore all" just won't work.

Yes, I agree.

With the given example "István" and having á in the dictionary encoding
the word is most probably mark as misspelled. But then it's possible to
Ignore it? Isn't there the option to discard the characters that cannot
be converted silently or replace them with something similar for the
dictionary lookup? Not quite correct, I know - but perhaps the better
strategy for the user?

Stephan

Cyrille Artho

2014-04-09 23:02:50 UTC

Usually given names are not in a language dictionary, although many
(translation) services have separate dictionaries for proper/given names.

We have two problems here:

(1) Language: I think most users are OK with proper names not being
accepted by the spell checker (before learning them). However, other
options such as "Ignore" should work, too.

(2) Encoding: Words having characters that are not part of the normal
character set in a given language, should behave in the same way as words
that are. This includes "István", "Vološinov", etc. So we have to use UTF-8
to look up words.

When down-converting text to the character set of the target language, we
can ignore non-convertible characters silently, but

echo 'István' | iconv -c -f utf-8 -t ascii

yields "Istvn", which is not very useful.

I think we have to use Unicode for all the given operations and (a) either
risk a mismatch for each word that is not learned/ignored, or (b)
up-convert words in the dictionary before they are matched. The latter
solution implies that the dictionary tool supports this; does anyone know
if that is the case (for at least one tool)?

>>
>> This is mixing languages with writing systems, IMHO. In fact language
>> sometimes has an implication on the spelling of names (if it comes to
>> transliteration), but with rather surpring effects. For instance, the
>> Russian name Воло́шинов is usually written Vološinov in German, but
>> Voloshinov in English. Is "š" a "German" character?
>
> I'm not a linguist and my knowledge about these things is limited. The
> change of language is the only possibility I know of to get out of the
> "broken" dictionary encoding scenario.
>
>> Also, I think that marking István as "Hungarian" absurds the language
>> concept.
>>
>> More technically, I think it will be irritating for users that they
>> can add "István" to the personal dictionary, while "Ignore" and
>> "Ignore all" just won't work.
>
> Yes, I agree.
>
> With the given example "István" and having á in the dictionary encoding
> the word is most probably mark as misspelled. But then it's possible to
> Ignore it? Isn't there the option to discard the characters that cannot
> be converted silently or replace them with something similar for the
> dictionary lookup? Not quite correct, I know - but perhaps the better
> strategy for the user?
>
> Stephan
>

--
Regards,
Cyrille Artho - http://artho.com/
Perilous to all of us are the devices of an art deeper than we
ourselves possess.
-- Gandalf the Grey [Tolkien, "Lord of the Rings"]

Jürgen Spitzmüller

2014-04-10 06:53:36 UTC

2014-04-10 1:02 GMT+02:00 Cyrille Artho <***@aist.go.jp>:

> I think we have to use Unicode for all the given operations and (a) either
> risk a mismatch for each word that is not learned/ignored, or (b)
> up-convert words in the dictionary before they are matched. The latter
> solution implies that the dictionary tool supports this; does anyone know
> if that is the case (for at least one tool)?

This is the problem here. Hunspell dictionaries are often not
unicode-encoded. So we are stuck with non-unicode encodings.

Jürgen

Cyrille Artho

2014-04-10 06:57:06 UTC

How is the call to iconv implemented? On the application level, the
interface is probably not flexible enough; it is easy to ignore
non-convertible characters, but they are just removed from the output.

The C library interface is probably richer. Is it possible to convert text
word by word, find out which words are not convertible, and ignore the
dictionary for those words? (The user could still choose to ignore/add them
to the custom dictionary.)

Maybe this requires a different way of integrating spell checkers?

Jürgen Spitzmüller wrote:
> 2014-04-10 1:02 GMT+02:00 Cyrille Artho <***@aist.go.jp
> <mailto:***@aist.go.jp>>:
>
> I think we have to use Unicode for all the given operations and (a)
> either risk a mismatch for each word that is not learned/ignored, or
> (b) up-convert words in the dictionary before they are matched. The
> latter solution implies that the dictionary tool supports this; does
> anyone know if that is the case (for at least one tool)?
>
>
> This is the problem here. Hunspell dictionaries are often not
> unicode-encoded. So we are stuck with non-unicode encodings.
>
> Jürgen

--
Regards,
Cyrille Artho - http://artho.com/
No problem is so formidable that you can't just walk away from it.
-- C. Schulz

Stephan Witt

2014-04-10 07:17:55 UTC

Am 10.04.2014 um 08:57 schrieb Cyrille Artho <***@aist.go.jp>:

> How is the call to iconv implemented? On the application level, the interface is probably not flexible enough; it is easy to ignore non-convertible characters, but they are just removed from the output.
>
> The C library interface is probably richer. Is it possible to convert text word by word, find out which words are not convertible, and ignore the dictionary for those words? (The user could still choose to ignore/add them to the custom dictionary.)

The ignore operation is part of the spell checker API.

You have to use the dictionary encoding for it, IMO.
At least it is safe to do so. The behavior for not
using the dictionary encoding when adding words at run
time is not documented.

> Maybe this requires a different way of integrating spell checkers?

Ideally the dictionaries should be converted to UTF-8, IMHO.
But they are not provided by the LyX developers.

Stephan

> Jürgen Spitzmüller wrote:
>> 2014-04-10 1:02 GMT+02:00 Cyrille Artho <***@aist.go.jp
>> <mailto:***@aist.go.jp>>:
>>
>> I think we have to use Unicode for all the given operations and (a)
>> either risk a mismatch for each word that is not learned/ignored, or
>> (b) up-convert words in the dictionary before they are matched. The
>> latter solution implies that the dictionary tool supports this; does
>> anyone know if that is the case (for at least one tool)?
>>
>>
>> This is the problem here. Hunspell dictionaries are often not
>> unicode-encoded. So we are stuck with non-unicode encodings.
>>
>> Jürgen
>
> --
> Regards,
> Cyrille Artho - http://artho.com/
> No problem is so formidable that you can't just walk away from it.
> -- C. Schulz

JeanMarc Lasgouttes

2014-04-10 08:11:09 UTC

There is something that I do not understand. If the word is not representable in the German dictionary, presumably it is not part of the language. "Lasgoittes" is perfectly representable in any latin encoding and yet the spell-checker will mark it as misspelled. Why should it be different for a name with weird accents?

JMarc

On 10 avril 2014 08:53:36 UTC+02:00, "JÃŒrgen SpitzmÃŒller" <***@lyx.org> wrote:
>2014-04-10 1:02 GMT+02:00 Cyrille Artho <***@aist.go.jp>:
>
>> I think we have to use Unicode for all the given operations and (a)
>either
>> risk a mismatch for each word that is not learned/ignored, or (b)
>> up-convert words in the dictionary before they are matched. The
>latter
>> solution implies that the dictionary tool supports this; does anyone
>know
>> if that is the case (for at least one tool)?
>
>
>This is the problem here. Hunspell dictionaries are often not
>unicode-encoded. So we are stuck with non-unicode encodings.
>
>JÃŒrgen

Jürgen Spitzmüller

2014-04-10 12:14:15 UTC

2014-04-10 10:11 GMT+02:00 JeanMarc Lasgouttes <***@lyx.org>:

> There is something that I do not understand. If the word is not
> representable in the German dictionary, presumably it is not part of the
> language. "Lasgoittes" is perfectly representable in any latin encoding and
> yet the spell-checker will mark it as misspelled. Why should it be
> different for a name with weird accents?
>

The point is that users cannot do something sensible with such marked words
(except for adding them into the personal dictionary).

Actually, I tend to convert all hunspell dictionaries to utf8. This seems
the only proper solution to this problem.

Jürgen

Jean-Marc Lasgouttes

2014-04-10 12:18:21 UTC

10/04/2014 14:14, Jürgen Spitzmüller:
> There is something that I do not understand. If the word is not
> representable in the German dictionary, presumably it is not part of
> the language. "Lasgoittes" is perfectly representable in any latin
> encoding and yet the spell-checker will mark it as misspelled. Why
> should it be different for a name with weird accents?
>
>
> The point is that users cannot do something sensible with such marked
> words (except for adding them into the personal dictionary).

Sure, but the same holds for "Lasgouttes", doesn't it?

JMarc

Jürgen Spitzmüller

2014-04-10 12:29:54 UTC

2014-04-10 14:18 GMT+02:00 Jean-Marc Lasgouttes <***@lyx.org>:

>
> The point is that users cannot do something sensible with such marked
>> words (except for adding them into the personal dictionary).
>>
>
> Sure, but the same holds for "Lasgouttes", doesn't it?
>

No, if the encoding fits, I can hit "Ignore all" and only ignore you (or
your name's spelling, for that matter) in the current document (which is
what I do for names usually, except for very recurrent names). If the
encoding does not fit, hitting "Ignore all" just would not work. I think we
would need to at least disable the ignore all button/menu entry in that
case, otherwise users would rightly complain about that bug (they would
also, probably, not understand why the function is disabled for specific
names.).

So, to sum up: I agree with all of you that strings from non-matching
encodings should be marked as unknown, but only if we can provide sensible
action.

Jürgen

BTW German hunspell suggests "Ausgelastet" for "Lasgouttes", which means
"fully occupied" or "snowed with work".

>
> JMarc
>
>

Jean-Marc Lasgouttes

2014-04-10 12:42:33 UTC

10/04/2014 14:29, Jürgen Spitzmüller:
> No, if the encoding fits, I can hit "Ignore all" and only ignore you (or
> your name's spelling, for that matter) in the current document (which is
> what I do for names usually, except for very recurrent names). If the
> encoding does not fit, hitting "Ignore all" just would not work. I think
> we would need to at least disable the ignore all button/menu entry in
> that case, otherwise users would rightly complain about that bug (they
> would also, probably, not understand why the function is disabled for
> specific names.).

I see.

> BTW German hunspell suggests "Ausgelastet" for "Lasgouttes", which means
> "fully occupied" or "snowed with work".

It is not so bad for a word that really does not look like the original.
Does Hunspell know me or what?

JMarc

Cyrille Artho

2014-04-10 23:36:20 UTC

I agree that it would be good to have all dictionaries in utf-8, but I'm
not sure if this is feasible for a typical user/installation.

Another option would be for LyX to tokenize the text and forward it word by
word to the spell checker.

This way, we could handle "Ignore All" in LyX itself rather than let the
spell checker ignore the word. LyX would never forward ignored words to the
spell checker but all the remaining words would be handled by the spell
checker.

Jürgen Spitzmüller wrote:
> 2014-04-10 14:18 GMT+02:00 Jean-Marc Lasgouttes <***@lyx.org
> <mailto:***@lyx.org>>:
>
>
> The point is that users cannot do something sensible with such marked
> words (except for adding them into the personal dictionary).
>
>
> Sure, but the same holds for "Lasgouttes", doesn't it?
>
>
> No, if the encoding fits, I can hit "Ignore all" and only ignore you (or
> your name's spelling, for that matter) in the current document (which is
> what I do for names usually, except for very recurrent names). If the
> encoding does not fit, hitting "Ignore all" just would not work. I think we
> would need to at least disable the ignore all button/menu entry in that
> case, otherwise users would rightly complain about that bug (they would
> also, probably, not understand why the function is disabled for specific
> names.).
>
> So, to sum up: I agree with all of you that strings from non-matching
> encodings should be marked as unknown, but only if we can provide sensible
> action.
>
> Jürgen
>
> BTW German hunspell suggests "Ausgelastet" for "Lasgouttes", which means
> "fully occupied" or "snowed with work".
>
>
> JMarc
>
>

--
Regards,
Cyrille Artho - http://artho.com/
The opposite of a correct statement is a false statement. But the
opposite of a profound truth may well be another profound truth.
-- Niels Bohr

Cyrille Artho

2014-04-10 23:40:04 UTC

Regarding the idea I just mentioned before, there is a major flaw.

Asian languages do not have spaces. Tokenizing a text into words requires a
dictionary and is a non-trivial problem (due to inflection: different verb
forms need to be recognized, etc.). We can therefore not just scan for
whitespaces and forward anything in between to a spell checker, unless we
restrict that workaround to Western languages.

(Unfortunately we use gmail, which filters out my own messages on mailing
lists, so I can't reply to my own message...)
--
Regards,
Cyrille Artho - http://artho.com/
The opposite of a correct statement is a false statement. But the
opposite of a profound truth may well be another profound truth.
-- Niels Bohr

Vincent van Ravesteijn

2014-04-11 06:10:30 UTC

On Fri, Apr 11, 2014 at 1:40 AM, Cyrille Artho <***@aist.go.jp> wrote:
> Regarding the idea I just mentioned before, there is a major flaw.
>
> Asian languages do not have spaces. Tokenizing a text into words requires a
> dictionary and is a non-trivial problem (due to inflection: different verb
> forms need to be recognized, etc.). We can therefore not just scan for
> whitespaces and forward anything in between to a spell checker, unless we
> restrict that workaround to Western languages.
>

Are there spellcheckers for e.g. Chinese ? It sounds a bit
contradicting as they don't have any "spelling". Of course there are
words consisting of multiple characters, but these characters can also
be used on their own.

Vincent

Stephan Witt

2014-04-11 06:23:02 UTC

Am 11.04.2014 um 01:36 schrieb Cyrille Artho <***@aist.go.jp>:

> I agree that it would be good to have all dictionaries in utf-8, but I'm not sure if this is feasible for a typical user/installation.
>
> Another option would be for LyX to tokenize the text and forward it word by word to the spell checker.

That's the way the hunspell and aspell spell checker backends work.

For Mac builds there is another one - the "native" OS service for spell checking.
The latter passes the complete paragraph to the spell checker engine.
This results in a) an improved performance and b) better results because of
the builtin automatic language detection. So there are less false positives.

The paragraph passing mode can be used for languages without easily detectable
word boundaries. Perhaps that way LyX is already able to spell check Chinese text
on Mac. I never tried that and I'm unable to judge the result.

Stephan

>
> This way, we could handle "Ignore All" in LyX itself rather than let the spell checker ignore the word. LyX would never forward ignored words to the spell checker but all the remaining words would be handled by the spell checker.
>
> Jürgen Spitzmüller wrote:
>> 2014-04-10 14:18 GMT+02:00 Jean-Marc Lasgouttes <***@lyx.org
>> <mailto:***@lyx.org>>:
>>
>>
>> The point is that users cannot do something sensible with such marked
>> words (except for adding them into the personal dictionary).
>>
>>
>> Sure, but the same holds for "Lasgouttes", doesn't it?
>>
>>
>> No, if the encoding fits, I can hit "Ignore all" and only ignore you (or
>> your name's spelling, for that matter) in the current document (which is
>> what I do for names usually, except for very recurrent names). If the
>> encoding does not fit, hitting "Ignore all" just would not work. I think we
>> would need to at least disable the ignore all button/menu entry in that
>> case, otherwise users would rightly complain about that bug (they would
>> also, probably, not understand why the function is disabled for specific
>> names.).
>>
>> So, to sum up: I agree with all of you that strings from non-matching
>> encodings should be marked as unknown, but only if we can provide sensible
>> action.
>>
>> Jürgen
>>
>> BTW German hunspell suggests "Ausgelastet" for "Lasgouttes", which means
>> "fully occupied" or "snowed with work".
>>
>>
>> JMarc
>>
>>
>
> --
> Regards,
> Cyrille Artho - http://artho.com/
> The opposite of a correct statement is a false statement. But the
> opposite of a profound truth may well be another profound truth.
> -- Niels Bohr

Jean-Marc Lasgouttes

2014-04-11 08:23:40 UTC

11/04/2014 08:23, Stephan Witt:
> Am 11.04.2014 um 01:36 schrieb Cyrille Artho <***@aist.go.jp>:
>
>> I agree that it would be good to have all dictionaries in utf-8, but I'm not sure if this is feasible for a typical user/installation.
>>
>> Another option would be for LyX to tokenize the text and forward it word by word to the spell checker.
>
> That's the way the hunspell and aspell spell checker backends work.
>
> For Mac builds there is another one - the "native" OS service for spell checking.
> The latter passes the complete paragraph to the spell checker engine.
> This results in a) an improved performance and b) better results because of
> the builtin automatic language detection. So there are less false positives.

So do you mean that if I write in an English text "somme" instead of
"some", if will be considered an OK work because "somme" exists in
French? Is that supposed to be a feature?

Recently I did some proofreading of a paper written with TeXShop (LaTeX
editor for Mac). It turned out that the text was peppered with french
words. From what I understand, this horror was a joint work of automatic
correction and automatic language detection :(

JMarc

Vincent van Ravesteijn

2014-04-11 08:31:02 UTC

> So do you mean that if I write in an English text "somme" instead of "some",
> if will be considered an OK work because "somme" exists in French? Is that
> supposed to be a feature?
>

I guess that the guessing is done for the whole paragraph....

> Recently I did some proofreading of a paper written with TeXShop (LaTeX
> editor for Mac). It turned out that the text was peppered with french words.
> From what I understand, this horror was a joint work of automatic correction
> and automatic language detection :(

I always thought that the French did this on purpose to not surrender
to the fact that the English language dominates the world.

Vincent

Jean-Marc Lasgouttes

2014-04-11 08:43:10 UTC

11/04/2014 10:31, Vincent van Ravesteijn:
>> Recently I did some proofreading of a paper written with TeXShop (LaTeX
>> editor for Mac). It turned out that the text was peppered with french words.
>> From what I understand, this horror was a joint work of automatic correction
>> and automatic language detection :(
>
> I always thought that the French did this on purpose to not surrender
> to the fact that the English language dominates the world.

In this case, it was just some evil North American programmer trying to
undermine our international credibility. Now you understand what we have
to endure.

JMarc

Cyrille Artho

2014-04-11 09:06:35 UTC

> I always thought that the French did this on purpose to not surrender
> to the fact that the English language dominates the world.
>
> Vincent
>
They did this to the keyboard layout, too. If you ever tried to use a
computer in a French Internet cafe, good luck typing your password (even
your username will be a challenge to type)! ;-)
--
Regards,
Cyrille Artho - http://artho.com/
Those who will not reason, are bigots, those who cannot,
are fools, and those who dare not, are slaves.
-- George Gordon Noel Byron

Jean-Marc Lasgouttes

2014-04-11 09:18:36 UTC

11/04/2014 11:06, Cyrille Artho:
>> I always thought that the French did this on purpose to not surrender
>> to the fact that the English language dominates the world.
>>
>> Vincent
>>
> They did this to the keyboard layout, too. If you ever tried to use a
> computer in a French Internet cafe, good luck typing your password (even
> your username will be a challenge to type)! ;-)

There is something worse than that: trying to program on a Mac with a
french keyboard layout. Characters like \, [, ] or | require
Shifp+Option modifier. They probably did not have French coders working
there.

JMarc

Stephan Witt

2014-04-11 09:23:07 UTC

Am 11.04.2014 um 10:23 schrieb Jean-Marc Lasgouttes <***@lyx.org>:

> 11/04/2014 08:23, Stephan Witt:
>> Am 11.04.2014 um 01:36 schrieb Cyrille Artho <***@aist.go.jp>:
>>
>>> I agree that it would be good to have all dictionaries in utf-8, but I'm not sure if this is feasible for a typical user/installation.
>>>
>>> Another option would be for LyX to tokenize the text and forward it word by word to the spell checker.
>>
>> That's the way the hunspell and aspell spell checker backends work.
>>
>> For Mac builds there is another one - the "native" OS service for spell checking.
>> The latter passes the complete paragraph to the spell checker engine.
>> This results in a) an improved performance and b) better results because of
>> the builtin automatic language detection. So there are less false positives.
>
> So do you mean that if I write in an English text "somme" instead of "some", if will be considered an OK work because "somme" exists in French? Is that supposed to be a feature?

Indeed. Following is the debug output of text input while instant-spellchecking is enabled:

AppleSpellChecker.cpp (95): spellCheck: "so" = OK, lang = en_US
AppleSpellChecker.cpp (95): spellCheck: "som" = FAILED, lang = en_US
Paragraph.cpp (4115): misspelled word: "som" [518..520]
AppleSpellChecker.cpp (95): spellCheck: "somm" = FAILED, lang = en_US
Paragraph.cpp (4115): misspelled word: "somm" [518..521]
AppleSpellChecker.cpp (95): spellCheck: "somme" = OK, lang = en_US
AppleSpellChecker.cpp (95): spellCheck: "somme " = OK, lang = en_US

Stephan

Jean-Marc Lasgouttes

2014-04-11 09:56:13 UTC

11/04/2014 11:23, Stephan Witt:
>> So do you mean that if I write in an English text "somme" instead
>> of "some", if will be considered an OK work because "somme" exists
>> in French? Is that supposed to be a feature?
>
> Indeed. Following is the debug output of text input while
> instant-spellchecking is enabled:
>
> AppleSpellChecker.cpp (95): spellCheck: "so" = OK, lang = en_US
> AppleSpellChecker.cpp (95): spellCheck: "som" = FAILED, lang = en_US
> Paragraph.cpp (4115): misspelled word: "som" [518..520]
> AppleSpellChecker.cpp (95): spellCheck: "somm" = FAILED, lang =
> en_US Paragraph.cpp (4115): misspelled word: "somm" [518..521]
> AppleSpellChecker.cpp (95): spellCheck: "somme" = OK, lang = en_US
> AppleSpellChecker.cpp (95): spellCheck: "somme " = OK, lang = en_US

Is there a way to avoid this "feature"?

JMarc

Stephan Witt

2014-04-11 10:01:01 UTC

Am 11.04.2014 um 11:56 schrieb Jean-Marc Lasgouttes <***@lyx.org>:

> 11/04/2014 11:23, Stephan Witt:
>>> So do you mean that if I write in an English text "somme" instead
>>> of "some", if will be considered an OK work because "somme" exists
>>> in French? Is that supposed to be a feature?
>>
>> Indeed. Following is the debug output of text input while
>> instant-spellchecking is enabled:
>>
>> AppleSpellChecker.cpp (95): spellCheck: "so" = OK, lang = en_US
>> AppleSpellChecker.cpp (95): spellCheck: "som" = FAILED, lang = en_US
>> Paragraph.cpp (4115): misspelled word: "som" [518..520]
>> AppleSpellChecker.cpp (95): spellCheck: "somm" = FAILED, lang =
>> en_US Paragraph.cpp (4115): misspelled word: "somm" [518..521]
>> AppleSpellChecker.cpp (95): spellCheck: "somme" = OK, lang = en_US
>> AppleSpellChecker.cpp (95): spellCheck: "somme " = OK, lang = en_US
>
> Is there a way to avoid this "feature"?

I don't know. It's a black box. It's a OS service.
Perhaps it can be configured somewhere, via System Preferences or API.

With LyX you can use hunspell as the spell checker backend instead.

Stephan

Jean-Marc Lasgouttes

2014-04-11 10:16:49 UTC

11/04/2014 12:01, Stephan Witt:
>> Is there a way to avoid this "feature"?
>
> I don't know. It's a black box. It's a OS service.
> Perhaps it can be configured somewhere, via System Preferences or API.
>
> With LyX you can use hunspell as the spell checker backend instead.
>
> Stephan
>

It looks liike there is at least some control.
http://macs.about.com/od/OSXLion107/qt/Os-X-Lion-Automatic-Spelling-Correction.htm

This feature is really scary. Can we limit the allowed languages? In
this case you could maybe only send strings with same language.

JMarc

Georg Baum

2014-04-10 18:43:17 UTC

Jürgen Spitzmüller wrote:

> The point is that users cannot do something sensible with such marked
> words (except for adding them into the personal dictionary).

It is probably not difficult to implement sensible behaviour for "ignore"
and "ignore all" for these words: HunspellChecker has already a member
variable ignored_ which tracks ignored words, so if words which created an
encoding error on spell checking would be kept in a different list as well,
then "ignore" and "ignore all" could simply add the affceted words to the
ignored list.

> Actually, I tend to convert all hunspell dictionaries to utf8. This seems
> the only proper solution to this problem.

Does this mean that we need to maintain our own versions? If not then it is
probably the best solution, if yes then I'd rather not do it.

Georg

Stephan Witt

2014-04-10 20:30:40 UTC

Am 10.04.2014 um 20:43 schrieb Georg Baum <***@post.rwth-aachen.de>:

> Jürgen Spitzmüller wrote:
>
>> The point is that users cannot do something sensible with such marked
>> words (except for adding them into the personal dictionary).
>
> It is probably not difficult to implement sensible behaviour for "ignore"
> and "ignore all" for these words: HunspellChecker has already a member
> variable ignored_ which tracks ignored words, so if words which created an
> encoding error on spell checking would be kept in a different list as well,
> then "ignore" and "ignore all" could simply add the affceted words to the
> ignored list.

Like another personal word list, but not a persistent one.

BTW: it depends on the spellchecker how it works.

This is the debug output of the Apple builtin spell checker:

AppleSpellChecker.cpp (95): spellCheck: "This is mixing languages with writing systems, IMHO. In fact language sometimes has an implication on the spelling of names (if it comes to transliteration), but with rather surpring effects. For instance, the Russian name Воло́шинов is usually written Vološinov in German, but Voloshinov in English. Is "š" a "German" character? " = FAILED, lang = en_US
Paragraph.cpp (4115): misspelled word: "surpring" [174..181]
Paragraph.cpp (4115): misspelled word: "Vološinov" [253..261]
Paragraph.cpp (4115): misspelled word: "Voloshinov" [278..287]

The "ignore" button simply works.

>> Actually, I tend to convert all hunspell dictionaries to utf8. This seems
>> the only proper solution to this problem.
>
> Does this mean that we need to maintain our own versions? If not then it is
> probably the best solution, if yes then I'd rather not do it.

+1

Stephan

Jürgen Spitzmüller

2014-04-11 07:55:48 UTC

2014-04-10 22:30 GMT+02:00 Stephan Witt:

> Am 10.04.2014 um 20:43 schrieb Georg Baum:
> > It is probably not difficult to implement sensible behaviour for
> "ignore"
> > and "ignore all" for these words: HunspellChecker has already a member
> > variable ignored_ which tracks ignored words, so if words which created
> an
> > encoding error on spell checking would be kept in a different list as
> well,
> > then "ignore" and "ignore all" could simply add the affceted words to the
> > ignored list.
>
> Like another personal word list, but not a persistent one.
>

Yes, this sounds like a good idea.

Jürgen

Jürgen Spitzmüller

2014-04-11 07:52:35 UTC

2014-04-10 20:43 GMT+02:00 Georg Baum:

> Does this mean that we need to maintain our own versions? If not then it is
> probably the best solution, if yes then I'd rather not do it.
>

The Chromium project does maintain utf8 versions (or "deltas", for that
matter):
http://www.chromium.org/developers/how-tos/editing-the-spell-checking-dictionaries

Jürgen

>
>
> Georg
>
>
>

Jean-Marc Lasgouttes

2014-04-09 07:52:36 UTC

09/04/2014 08:14, Jürgen Spitzmüller:
> 2014-04-08 22:42 GMT+02:00 Georg Baum:
>
> The change in src/support/unicode.cpp is problematic: It disables
> all error
> handling, not only the lyxerr output. Also, if you now throw an
> exception
> there, you need to make sure that all callers can cope with that.
> Maybe one
> solution would be to throw the exception at the very end (after the
> error
> handling), and give it exactly the error message which is now written to
> lyxerr. Then each caller can decide what it wants to do with the error
> message.
>
>
> Thanks. I feared that. I put the exception that early in order to
> suppress the lyxerr message. In this case we need to audit all callers.
> Will postpone this.

I think the lyxerr message could be rewritten to be at least useful. Who
found a use for the use hex dump anyways?

JMarc

Jürgen Spitzmüller

2014-04-09 08:41:11 UTC

2014-04-09 9:52 GMT+02:00 Jean-Marc Lasgouttes <***@lyx.org>:

> I think the lyxerr message could be rewritten to be at least useful. Who
> found a use for the use hex dump anyways?
>

Agreed.

Jürgen

>
> JMarc
>
>

Georg Baum

2014-04-09 19:01:02 UTC

Jürgen Spitzmüller wrote:

> 2014-04-09 9:52 GMT+02:00 Jean-Marc Lasgouttes <***@lyx.org>:
>
>> I think the lyxerr message could be rewritten to be at least useful. Who
>> found a use for the use hex dump anyways?
>>
>
> Agreed.

Me too. I think this output is still unchanged from the time when we had
bugs in our own unicode code.

However, I still think that there is a problem unrelated to the error
output. I understand that it is annyoing if names are underlined, but on the
other hand it is also annyoing if a misspelled word is not underlined.
Unfortunately I have no solution.

Georg

36 Replies
10 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Jürgen Spitzmüller 2014-04-06 10:02:28 UTC

Georg Baum 2014-04-08 20:42:28 UTC

Jürgen Spitzmüller 2014-04-09 06:14:47 UTC

Stephan Witt 2014-04-09 06:40:17 UTC

Jürgen Spitzmüller 2014-04-09 06:53:21 UTC

Stephan Witt 2014-04-09 08:59:34 UTC

Jürgen Spitzmüller 2014-04-09 09:12:54 UTC

Stephan Witt 2014-04-09 20:45:51 UTC

Cyrille Artho 2014-04-09 23:02:50 UTC

Jürgen Spitzmüller 2014-04-10 06:53:36 UTC

Cyrille Artho 2014-04-10 06:57:06 UTC

Stephan Witt 2014-04-10 07:17:55 UTC

JeanMarc Lasgouttes 2014-04-10 08:11:09 UTC

Jürgen Spitzmüller 2014-04-10 12:14:15 UTC

Jean-Marc Lasgouttes 2014-04-10 12:18:21 UTC

Jürgen Spitzmüller 2014-04-10 12:29:54 UTC

Jean-Marc Lasgouttes 2014-04-10 12:42:33 UTC

Cyrille Artho 2014-04-10 23:36:20 UTC

Cyrille Artho 2014-04-10 23:40:04 UTC

Vincent van Ravesteijn 2014-04-11 06:10:30 UTC

Stephan Witt 2014-04-11 06:23:02 UTC

Jean-Marc Lasgouttes 2014-04-11 08:23:40 UTC

Vincent van Ravesteijn 2014-04-11 08:31:02 UTC

Jean-Marc Lasgouttes 2014-04-11 08:43:10 UTC

Cyrille Artho 2014-04-11 09:06:35 UTC

Jean-Marc Lasgouttes 2014-04-11 09:18:36 UTC

Stephan Witt 2014-04-11 09:23:07 UTC

Jean-Marc Lasgouttes 2014-04-11 09:56:13 UTC

Stephan Witt 2014-04-11 10:01:01 UTC

Jean-Marc Lasgouttes 2014-04-11 10:16:49 UTC

Georg Baum 2014-04-10 18:43:17 UTC

Stephan Witt 2014-04-10 20:30:40 UTC

Jürgen Spitzmüller 2014-04-11 07:55:48 UTC

Jürgen Spitzmüller 2014-04-11 07:52:35 UTC

Jean-Marc Lasgouttes 2014-04-09 07:52:36 UTC

Jürgen Spitzmüller 2014-04-09 08:41:11 UTC

Georg Baum 2014-04-09 19:01:02 UTC

about - legalese

Loading...