Lyx<-->Word conversion and images: suggestions sought

Discussion:

stefano franchi

2014-05-29 14:20:05 UTC

Dear all:

Prannoy just hit upon a problem in text4ht conversion of images to odt
format.
This is what text4ht should do:

1. Read the correct reference from the tex file
2. If necessary, covert the image to a suitable (bitmap) format for
inclusion in the odt file
3. copy the converted image to a proper subfolder in the odt file (which
really is a zipped archive, as you know).

The system works fine if the images are in the same directory/folder as the
.tex source file, but tex4ht gets its references screwed up otherwise. It
copies the images to the right "internal" folder (inside the odt file), but
the xml code is incorrect and libreoffice cannot find the images when you
open the file.
Rather than messing up with with tex4ht code, it seems to me the easier
workaround is to copy all the images and the tex source file to a temporary
folder and do the conversion there. This is a similar approach to what LyX
itself does before compiling LaTeX, in fact, and it is also what the native
HTML converter does, I believe.

Issues on which feedback is welcome:

0. General thoughts about the approach (copying images vs. fixing tex4ht
processing)

1. My thinking is that it may be relatively easy to do the copying by
operating on the LyX source code, i.e. via a python script scanning the
Graphics inset. That would tie the tool to the file format, though.

2. For the roundtrip conversion, we also need to keep the original filename
references and store them somewhere in the odt file (in an annotation
field). Are there any platform-related issues on filenames we should be
aware of (encoding, folder delimiters, etc)?

Cheers,

Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies Ph: +1 (979) 845-2125
Texas A&M University Fax: +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org

Cyrille Artho

2014-05-29 23:41:17 UTC

Permalink

Dear all,
My thoughts:

I think it is not too uncommon to have two images with the same name in
different directories, such as pics/old/arch.fig and pics/new/arch.fig.

Obviously in such cases the renaming scheme needs to take this into
account, by converting path separators ("/") to another character. However,
that in turn means that that "other character" should not appear in the
file name, so a means of "escaping" it would be needed as well.

For example: Path separators get converted to "_-"; "_" get converted to
"__". In that case, a reverse substitution is always possible.

In short, the solution is workable but has its drawbacks.

If it's possible to fix tex4ht instead, that would be better as tex4ht
users would also benefit. I agree that the code is not so easy to
read/maintain, but this may be a good starting point.

A full "flattening" of path separators has the advantage that we don't have
to worry about the platform the resulting document is used on. In other
words, if we export a nested path name, then the target computer has to
understand the path delimiter in the same way as the source computer. I
guess nowadays any computer understands "/" so maybe this is not an issue
with .odt. If "/" can indeed by used for .odt path names even on Windows,
then keeping the original path names may be the more elegant solution, even
if fixing tex4ht may be a bit difficult.

Prannoy just hit upon a problem in text4ht conversion of images to odt format.
1. Read the correct reference from the tex file
2. If necessary, covert the image to a suitable (bitmap) format for
inclusion in the odt file
3. copy the converted image to a proper subfolder in the odt file (which
really is a zipped archive, as you know).
The system works fine if the images are in the same directory/folder as the
.tex source file, but tex4ht gets its references screwed up otherwise. It
copies the images to the right "internal" folder (inside the odt file), but
the xml code is incorrect and libreoffice cannot find the images when you
open the file.
Rather than messing up with with tex4ht code, it seems to me the easier
workaround is to copy all the images and the tex source file to a temporary
folder and do the conversion there. This is a similar approach to what LyX
itself does before compiling LaTeX, in fact, and it is also what the native
HTML converter does, I believe.
0. General thoughts about the approach (copying images vs. fixing tex4ht
processing)
1. My thinking is that it may be relatively easy to do the copying by
operating on the LyX source code, i.e. via a python script scanning the
Graphics inset. That would tie the tool to the file format, though.
2. For the roundtrip conversion, we also need to keep the original filename
references and store them somewhere in the odt file (in an annotation
field). Are there any platform-related issues on filenames we should be
aware of (encoding, folder delimiters, etc)?
Cheers,
Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies Ph: +1 (979) 845-2125
Texas A&M University Fax: +1 (979) 845-6421
College Station, Texas, USA
http://stefano.cleinias.org

--
Regards,
Cyrille Artho - http://artho.com/
Give a man a fish, and you feed him for a day.
Teach a man to fish, and he'll invite himself over for dinner.
-- Calvin Keegan

stefano franchi

2014-05-30 15:40:42 UTC

Permalink

Post by Cyrille Artho
Dear all,
I think it is not too uncommon to have two images with the same name in
different directories, such as pics/old/arch.fig and pics/new/arch.fig.
Obviously in such cases the renaming scheme needs to take this into
account, by converting path separators ("/") to another character. However,
that in turn means that that "other character" should not appear in the
file name, so a means of "escaping" it would be needed as well.
For example: Path separators get converted to "_-"; "_" get converted to
"__". In that case, a reverse substitution is always possible.
In short, the solution is workable but has its drawbacks.
If it's possible to fix tex4ht instead, that would be better as tex4ht
users would also benefit. I agree that the code is not so easy to
read/maintain, but this may be a good starting point.
A full "flattening" of path separators has the advantage that we don't
have to worry about the platform the resulting document is used on. In
other words, if we export a nested path name, then the target computer has
to understand the path delimiter in the same way as the source computer. I
guess nowadays any computer understands "/" so maybe this is not an issue
with .odt. If "/" can indeed by used for .odt path names even on Windows,
then keeping the original path names may be the more elegant solution, even
if fixing tex4ht may be a bit difficult.

Good points Cyrille, thanks for the feedback.

Prannoy is now working on related issues in image conversion and set aside
this particular problem for the moment. I guess a more thorough
investigation of why tex4ht gets the references wrong would be warranted,
and it is not an easy task.

Cheers,

Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies Ph: +1 (979) 845-2125
Texas A&M University Fax: +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org

Georg Baum

2014-05-30 20:53:42 UTC

Permalink

This assumes that the original file name is not stored as additional
metadata (otherwise no bidirectional mapping is needed). This is elegant,
but may lead to very long file names inside the archive. I am not sure which
alternative is better.

Post by Cyrille Artho
If it's possible to fix tex4ht instead, that would be better as tex4ht
users would also benefit. I agree that the code is not so easy to
read/maintain, but this may be a good starting point.
A full "flattening" of path separators has the advantage that we don't
have to worry about the platform the resulting document is used on. In
other words, if we export a nested path name, then the target computer has
to understand the path delimiter in the same way as the source computer. I
guess nowadays any computer understands "/" so maybe this is not an issue
with .odt. If "/" can indeed by used for .odt path names even on Windows,
then keeping the original path names may be the more elegant solution,
even if fixing tex4ht may be a bit difficult.

The delimiter is not the problem. Even if '/' would not work on windows (it
does in 99% of the cases), the converter could easily convert it to '\' when
converting from .odt to .lyx on windows. Unfortunately you need to consider
the platform even for flat names (see the wiki page in my other message) A
colon ':' or a double quote '"' is not allowed in windows filenames, but
works fine on unix.

Georg

Georg Baum

2014-05-30 20:42:44 UTC

Permalink

Post by stefano franchi
Prannoy just hit upon a problem in text4ht conversion of images to odt
format.
1. Read the correct reference from the tex file
2. If necessary, covert the image to a suitable (bitmap) format for
inclusion in the odt file

The image conversion could also be delegated to LyX: If LyX would know that
a LaTeX export is in reality for odt production, it could use its own
converter machinery and provide both the file needed for tex4ht and for
inclusion in odt. The advantage would be that the conversion would always
start from the original image, which can result in higher quality or no
conversion at all (e.g. if the original image is suitable for odt but not
tex).

Post by stefano franchi
3. copy the converted image to a proper subfolder in the odt file (which
really is a zipped archive, as you know).
The system works fine if the images are in the same directory/folder as
the .tex source file, but tex4ht gets its references screwed up otherwise.
It copies the images to the right "internal" folder (inside the odt file),
but the xml code is incorrect and libreoffice cannot find the images when
you open the file.

Why is it important to have the same directory structure inside the .odt as
the user uses for LyX outside? This structure is internal, the user never
sees it. Therefore I'd simply use a flat structure internally, and no tex4ht
changes are required.

Post by stefano franchi
Rather than messing up with with tex4ht code, it seems to me the easier
workaround is to copy all the images and the tex source file to a
temporary folder and do the conversion there. This is a similar approach
to what LyX itself does before compiling LaTeX, in fact, and it is also
what the native HTML converter does, I believe.

If you execute the converter as part of the standard conversion chain in
LyX, then LyX will automatically flatten the directory structure, and all
included files are in the same directory as the .tex file itself.

Post by stefano franchi
0. General thoughts about the approach (copying images vs. fixing tex4ht
processing)
1. My thinking is that it may be relatively easy to do the copying by
operating on the LyX source code, i.e. via a python script scanning the
Graphics inset. That would tie the tool to the file format, though.

I would avoid that if possible, since this introduces ambiguity: Suddenly
you read contents both from .tex and .lyx files. How do you ensure that
these two files are consistent?. If it is really too difficult to fix
tex4ht, then I'd rather scan the .tex file (similar to
lib/scripts/tex_copy.py) instead.

Post by stefano franchi
2. For the roundtrip conversion, we also need to keep the original
filename references and store them somewhere in the odt file (in an
annotation field). Are there any platform-related issues on filenames we
should be aware of (encoding, folder delimiters, etc)?

There are several (http://wiki.lyx.org/LaTeX/FilesWithSpecialChars might
help). What would you do with absolute paths? It may be impossible to
restore them. Folder delimiters are easy, you can use forward slashes for
internal storage, just as in LyX. The encoding is always the one which is
active in the .tex file at the place where the filename occurs. Also keep in
mind that for master/child documents relative paths are always relative to
the master in .tex files.

Preserving the paths is only important for roundtrip. For one-way export it
does not matter. To me it does not look like the most important problem, and
I'd postpone it. If the image conversion works in general (for files in the
same directory as the main .tex file), then one could think which additional
metadata would be needed to restore them to the original location for
roundtrip. This might not only be the file name, it might also be the
complete image in the original format.

Georg

stefano franchi

2014-05-31 01:08:21 UTC

Permalink

Post by Georg Baum

Excellent point. I hadn't thought of this. So let's say LyX's existing
converters take care of the conversions.

Post by Georg Baum

otherwise.

Post by stefano franchi
It copies the images to the right "internal" folder (inside the odt

file),

Post by stefano franchi
but the xml code is incorrect and libreoffice cannot find the images when
you open the file.

Sorry, I didn't explain the problem correctly. Of course, the internal
structure of the ODT file should not matter. But tex4ht, instead, tries
(and fails) to recreate the folder structure *inside* the odt zipped
archive. That's the (known) bug.

Example: Say I have a tex file test.tex in folder "Folder" which contains a
png image picture.png in the same folder. tex4ht will create ODT file
containing a subfolder called "Pictures" and put image.png in it, then
insert the correct reference in the content.xml file, and everything works
fine.

But now say the image is in Folder/Figures/picture.png. Tex4ht will create
a folder called Pictures/Figures inside the ODT archive, and will screw up
the reference to it in the content.xml file. What it should do is to act
similarly to the previous case: copy the image to the Pictures folder an
*not* try to recreate the folder structure.

Post by Georg Baum
Rather than messing up with with tex4ht code, it seems to me the easier
workaround is to copy all the images and the tex source file to a
temporary folder and do the conversion there. This is a similar approach
to what LyX itself does before compiling LaTeX, in fact, and it is also
what the native HTML converter does, I believe.

If you execute the converter as part of the standard conversion chain in

Post by Georg Baum
LyX, then LyX will automatically flatten the directory structure, and all
included files are in the same directory as the .tex file itself.

Good point.

Post by Georg Baum

I'll take a look at tex_copy.py. Seems like the best way to go, at least in
the interim.

Post by Georg Baum

There are several (http://wiki.lyx.org/LaTeX/FilesWithSpecialChars might
help). What would you do with absolute paths? It may be impossible to
restore them. Folder delimiters are easy, you can use forward slashes for
internal storage, just as in LyX. The encoding is always the one which is
active in the .tex file at the place where the filename occurs. Also keep in
mind that for master/child documents relative paths are always relative to
the master in .tex files.
Preserving the paths is only important for roundtrip. For one-way export it
does not matter. To me it does not look like the most important problem, and
I'd postpone it. If the image conversion works in general (for files in the
same directory as the main .tex file), then one could think which additional
metadata would be needed to restore them to the original location for
roundtrip. This might not only be the file name, it might also be the
complete image in the original format.

Your comments make me think we should perhaps look at the problem a bit
differently. The two basic scenarios for Word export are:

1. cooperation with word user (roundtrip)

2. interaction with publishers who only accept .doc format

Given the limitations of odt's graphic formats, we may assume that the
images contained *in* the ODT file are really just draft/placeholders and
the real (higher def, vector, etc.) images are stored separately and will
either be sent to the publisher separately (scenario 2) or dealt
appropriately by LaTeX for final pdf production (scenario 1).

Under this assumption, we are free to delegate both image conversion and
directory flattening to LyX and we only need to make sure the ODT file has
the correct metadata to recreate the image reference as originally stored
in the LyX file.
And we can avoid storing the original image *in* the ODT file (if that's
possible, I don't know).

Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies Ph: +1 (979) 845-2125
Texas A&M University Fax: +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org

Georg Baum

2014-06-02 19:02:37 UTC

Permalink

On Fri, May 30, 2014 at 3:42 PM, Georg Baum
Sorry, I didn't explain the problem correctly. Of course, the internal
structure of the ODT file should not matter. But tex4ht, instead, tries
(and fails) to recreate the folder structure *inside* the odt zipped
archive. That's the (known) bug.
Example: Say I have a tex file test.tex in folder "Folder" which contains a
png image picture.png in the same folder. tex4ht will create ODT file
containing a subfolder called "Pictures" and put image.png in it, then
insert the correct reference in the content.xml file, and everything works
fine.
But now say the image is in Folder/Figures/picture.png. Tex4ht will create
a folder called Pictures/Figures inside the ODT archive, and will screw up
the reference to it in the content.xml file. What it should do is to act
similarly to the previous case: copy the image to the Pictures folder an
*not* try to recreate the folder structure.

Ah, OK. I guessed what the problem of tex4ht was, but I misunderstood the
wanted behaviour.

Your comments make me think we should perhaps look at the problem a bit
1. cooperation with word user (roundtrip)
2. interaction with publishers who only accept .doc format
Given the limitations of odt's graphic formats, we may assume that the
images contained *in* the ODT file are really just draft/placeholders and
the real (higher def, vector, etc.) images are stored separately and will
either be sent to the publisher separately (scenario 2) or dealt
appropriately by LaTeX for final pdf production (scenario 1).
Under this assumption, we are free to delegate both image conversion and
directory flattening to LyX and we only need to make sure the ODT file has
the correct metadata to recreate the image reference as originally stored
in the LyX file.
And we can avoid storing the original image *in* the ODT file (if that's
possible, I don't know).

Sounds sensible. However, it is hard to believe that the odt format does not
support at least one vector graphics and one bitmap graphics format. I would
guess that it is possible (with some effort) to store good quality images
inside the .odt (but this is rather a long term thought, nothing for now).

Georg

stefano franchi

2014-06-03 13:39:13 UTC

Permalink

Post by Georg Baum

On Fri, May 30, 2014 at 3:42 PM, Georg Baum
Under this assumption, we are free to delegate both image conversion and
directory flattening to LyX and we only need to make sure the ODT file

has

the correct metadata to recreate the image reference as originally stored
in the LyX file.
And we can avoid storing the original image *in* the ODT file (if that's
possible, I don't know).

You're right, I didn't make myself clear again. LIbreoffice supports a
number of graphic formats. Not pdf though. It does support EPS and SVG,
plus a number of common bitmap formats such as png, gif, jpeg, etcetera.
(SVG does not seem to work well, thougg, at least not with my
inkscape-generated SVG files).
Prannoy has worked out how to instruct tex4ht to use "convert" and produce
png from pdf. I guess he could substitute a conversion to EPS instead (with
pdftops -eps), but I am not sure if there is any real benefit. I am really
having troubles imagining a scenario in which a self-contained, converted
ODT file would be used to produce high-quality printed output.
But I may be lacking imagination.

Cheers,

S.

Post by Georg Baum
Georg

--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies Ph: +1 (979) 845-2125
Texas A&M University Fax: +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org

Georg Baum

2014-06-03 19:57:01 UTC

Permalink

Post by stefano franchi
You're right, I didn't make myself clear again. LIbreoffice supports a
number of graphic formats. Not pdf though. It does support EPS and SVG,
plus a number of common bitmap formats such as png, gif, jpeg, etcetera.
(SVG does not seem to work well, thougg, at least not with my
inkscape-generated SVG files).

Thanks for the explanation, I understand it now.

Post by stefano franchi
Prannoy has worked out how to instruct tex4ht to use "convert" and
produce png from pdf. I guess he could substitute a conversion to EPS
instead (with pdftops -eps), but I am not sure if there is any real
benefit. I am really having troubles imagining a scenario in which a
self-contained, converted ODT file would be used to produce high-quality
printed output. But I may be lacking imagination.

I have no idea whether that scenario is used. However, it is not important
right now IMHO: If tex4ht can be made to call convert, then it could use any
converter later if needed (in theory).

Georg

stefano franchi

2014-06-03 22:32:39 UTC

Permalink

Post by Georg Baum

I have no idea whether that scenario is used. However, it is not important
right now IMHO: If tex4ht can be made to call convert, then it could use any
converter later if needed (in theory).

That (calling convert) was achieved in the last few days. It is even
capable of passing the \includegraphics scaling parameter to it, so that
both pdf's (once converted to png) and png's are converted and scaled
appropriately.
To simplify tex4ht's level support for conversion (which is not pretty, let
alone readable) I think we could limit it to support the graphics format
pdflatex itself (and its followers such as XeTeX and LuaTeX) supports: pdf,
jpeg, png. If the user inserts images in some other format (since LyX
supports more) the conversions to these three formats can be done by LyX
itself on export to ODT, as it does now on export to pdflatex.

At any rate, Prannoy is now moving to tex4ht's math support, where I expect
thornier issues will come up soon.

Cheers,

Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies Ph: +1 (979) 845-2125
Texas A&M University Fax: +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org