Go to page:

Search:   Help

Edit

# LyxToWordConversion

Categories: GSOC
<< | Page list | >>

Roundtrip Conversion from LyX to ODT

The overall objective of the project is to develop a framework which enables Round trip conversion from LyX to ODT. Unlike normal export,in this case the same LyX file should result if the exported ODT file is imported back into LyX. First goal of the project would be to make one click export to ODT up and running. Framework should be planned in such a way that it is backend independent so that the same framework with appropriate changes can be used after deciding the feature set of the some other conversion. Metadata needs to be carried around to facilitate the Round trip conversion. Final outcome of the project will be a robust conversion tool from LyX to ODT and a prototype of the reverse conversion tool.

# Developers Involved in this Project

Student participant: Prannoy Pilligundla

Mentors : Stefano Franchi, Richard Heck, Cyrille Artho, Vincent van Ravesteijn

# Implementation

After analysing and exploring many approaches,going ahead with TeX4ht for the conversion seems the best way to go. TeX4ht doesn’t work on Lyx files directly so Lyx to LaTeX export is essential in this approach. First step would be to fix all the issues with TeX4ht in the conversion of feature set containing sections, headers, lists, comments, tracked changes, math, bibliography, tables, figures, cross references, footnotes, emphasis etc. This will involve editing some sources written in special literal programming format and inserting configurable hooks into the tex4ht-ooffice.tex which inturn would involve some command redefinitions in *.4ht files. Then after tuning the export to ODT as per the requirements next thing to do would be to replace the last step of TeX4ht with a Lua module in LuaTeX which relies on the same *.4ht files. In the last step TeX4ht processes the DVI file which has special instructions to complete the conversion. A package needs to be written in Lua which has full access to the nodes created by LuaTeX and loads tex4ht.sty to have access to the *.4ht files. LuaTeX gives access to the internals of TeX Engine which were previously inaccessible to programmers. Using the power of LuaTeX special instructions of TeX4ht need to be tapped before they are written into the DVI file and then handled appropriately to complete the conversion.
This needs to be done because TeX4ht doesn’t support fonts without a tfm table,which means that opentype and truetype fonts used by fontspec cannot be used. Proof of this concept can be seen here. Next objective would be to include metadata in the export and work on prototyping the reverse conversion by exploring various ways to do a ODT to LaTeX conversion as LaTeX to Lyx converter is already there.Decision needs to be taken on whether to use an existing library like Writer2LaTeX or go by a Python script approach to directly import from ODT to Lyx by carrying out tests on Writer2LaTeX

# Schedule

Till 18 May: Build TeX4ht from source and make changes in the makefile to fix issues(if any) in building from source.Dig deep into the working of TeX4ht and understand how it works. Write some basic configuration files to get into the flow of things. Get familiar with LuaTeX and explore Lua based approach of tapping special instructions before writing into the DVI file
19 May: Begin coding

FromToMilestone
19 May26 MayFix the issues in the conversion of quotation,verse,subparagraph,poem title environments and improve the the conversion of Lists
27 May2 JuneFix the issues in the conversion of Sectioning,Emphasis,Comments and Greyed out notes
3 June9 JuneImprove Biblatex support and find out issues with styles like Biblatex-Philosophy.Work on successful conversion of Tables,Cross References and Figures
10 June23 JuneImprove MathML support . Do rigorous testing to observe cases of failure and write a configuration file to load images instead of MathML in the areas where TeX4ht fails. Store the LaTeX expression as it is using the annotation property(metadata for Round trip)
24 June30 JuneReview the overall export.Buffer time incase of any unexpected technical difficulties
1 July14 JulyProvide fontspec support by replacing last step of TeX4ht with a Lua module for LuaTeX
15 July21 JulyBuffer time to complete the Lua module and parallely start work on including metadata into the export
22 July28 JulyComplete a successful export along with the meta data and integrate Lyx to LaTeX converter with LaTeX to ODT converter. Provide a one click export from Lyx to ODT
29 July4 AugustExplore various options to get back to LaTeX from ODT and propose a solution to do a Round trip along with the metadata
5 August11 AugustDebugging and making final changes.Mainly a Buffer Period to tackle unexpected difficulties
12 August18 AugustMake changes as demanded by the TeX4ht license,improve documentation and make changes suggested by the community

2 June(First Subgoal): Tex4ht successfully converts LaTeX to ODT for a limited feature set of Standard Paragraphs,Sections,Emphasis,Comments,Lists and Greyed out notes
23 June(Midterm milestone): Successful conversion of LaTeX to ODT
14 July(Second Subgoal): Lua module successfully replaces the last step of TeX4ht
11 August: Pencils down stage,final goal almost achieved
22 August:Final Evaluation

# Scope

Though this project focusses on conversion of Lyx to ODT,it may still attract attention of users who want to convert Lyx to .docx. When generated ODT is converted to .docx using LibreOffice and if the .docx file is opened using MS Word,the results are pretty good for most of the things but there are some issues with tables. In case of tables all the information is properly formatted but there were no lines drawn between rows and columns. As expected everything works fine when the .docx file is opened with LibreOffice itself. Similar issues arise when the ODT file is directly opened with MS Word. These tests are not accurate for Math because translation of formulas from LaTeX to ODT is not yet perfect so nothing can be commented on ODT to .docx conversion of Math as of now.

# Progress

## LyX to ODT

• A shell script to convert Lyx file directly to ODT
• A shell script to compile all the literate sources of tex4ht and save the literate sources of tex4ht and save the generated files in relevant directories
• Issue with level 5 Paragraph environment fixed
• /texttt style uses TeXGyreCursor instead of monospace
• Quotation,Quote and Verse Environments are no longer put in a different section and are left alone in the main section with proper paragraph style.
• Issue with spurious characters in lists,sections etc is fixed.(Spurious characters used to be present at many places like in between numbering and text in sections etc etc)
• Configured a LyX specific environment "lyxgreyedout" for ODT conversion
• Poemtitle environment is converted into a corresponding environment
• Fixed the issue with Tables
• Even pdf images are correctly imported to the resultant ODT file
• Issue with Sizing of images is fixed. Images are imported according to the scale
• Issue with \array, equations and tables with expressions is fixed
• Created new environments to carry inline math and other math environments to successfully carry the LaTeX math expression as it is into the XML of resultant ODT file.
• Edited configure.py to provide an option of Exporting to ODT(Roundtrip) format.
• Have a look at tex4htTesting branch of LyX GSoC repository for more details

## ODT to LyX

• Wrote a Python Script which successfully converts ODT to LyX
• Supports conversion of Standard Environments, Sections, Lists, Figures, Tables, Footnotes and Floats
• In case of the modified ODT file, expression in StarMath Notation is brought back and in case of non-modified ODT file, LaTeX expression is brought back as it is from the metadata
• Edited configure.py to provide an option of Importing ODT from LyX
• Have a look at odt2lyx branch of LyX GSoC repository for more details

# General Documentation

## Compiling tex4ht-ooffice.tex

After making changes in tex4ht-ooffice.4ht if we compile it as "tex tex4ht-ooffice.tex", one of the output files is ooffice.4ht which needs to be kept in the same directory as the tex file to be converted for this to have effect. While development in Linux I wrote a shell script which uses the Makefile to compile tex4ht-ooffice.4ht directly if the argument given is ooffice.4ht. After compilation is done, it moves the ooffice.4ht into the 4htfiles directory and other files generated into appropriate directories

## Configuring Greyedout Text in tex4ht

TeX4ht was treating lyxgreyedout as normal text, so I just had to style it appropriately. As it is an environment it has default hooks i.e before-env, after-env, before-list, after-list. For lyxgreyedout, before-list and after-list are not of use as lyxgreyedout doesn't have any heading kind of a thing. So, we just need to style two of the existing four hooks i.e before-env and after-env.

This is the piece of code I added to tex4ht-ooffice.tex for configuring lyxgreyedout

    \<ooffice begin-end env\><<<
\NewConfigureOO{lyxgreyedout}
\ConfigureOO{lyxgreyedout}{%
<style:style style:name="First-line-indent"
style:family="paragraph"
style:parent-style-name="Text-body"
style:class="text">\Hnewline
<style:paragraph-properties  fo:margin-left="0cm"
fo:margin-right="0cm"
fo:text-indent="0.499cm"
style:auto-text-indent="false"/>
</style:style>\Hnewline
<style:style style:name="greyedoutnote" style:family="text">\Hnewline
<style:text-properties fo:color="\#808080"/>\Hnewline
</style:style>\Hnewline
}
\ConfigureEnv{lyxgreyedout}{\HCode{<text:p text:style-name="First-line-indent">}\HCode{<text:span text:style-name="greyedoutnote">}}
{\HCode{</text:span>}\HCode{</text:p>\Hnewline}}{}{}

>>>


<ooffice begin-end env\> is also very important as this instructs how to treat the piece of code under it while compiling tex4ht-ooffice.tex to generate ooffice.4ht. If we write something like <configure ooffice tex4ht\> istead of that things would never work out. \NewConfigureOO{} is used to write something into the styles.xml of the ODT file. So in \ConfigureOO{lyxgreyedout} I created a proper style to be used for lyxgreyedout. I can directly use this created style inside content.xml with text:style-name as "greyedoutnote". Next I used \ConfigureEnv{lyxgreyedout} to configure the environment. Hooks of \ConfigureEnv are as follows \ConfigureEnv{name}{before-env}{after-env}{before-list}{after-list}. As we are using only before-env and after-env I styled lyxgreyedout accordingly.

## Configuring PoemTitle Environment for memoir

Difference between configuring lyxgreyedout and PoemTitle is that lyxgreyed was a custom environment defined by us seeing things in the perspective of a tex file. But PoemTitle comes from the memoir class, so we need to insert hooks into it to be able to style it inside tex4ht-ooffice.tex. While handling lyxgreyedout we didn't have to bother about inserting new hooks as it a custom environment in tex's perspective as pointed out earlier. I thank Michal Hoftich for guiding me on how to insert hooks. So the entire credit for this subsection goes to Michal.

First we need to find out where memoir class is redefined to insert the hooks, which is the file tex4ht-4ht.tex and we should also find memoir source code in memoir.cls to see which macros would be best to redefine. In the case of \PoemTitle, code in memoir.cls is complex, lot of stuff happens because of table of contents, numbering etc. The title itself is printed with macro @makePoemTitlehead, also \printPoemTitlenum for poem mark could be used for configuration of the poem number. \afterPoemTitlenum inserts unwanted paragraph, so we need also deal with it. Now we need to find memoir section in file tex4ht-4ht.tex. All command redefinitions happens in this file. I added the following code to memoir cfg section of tex4ht-4ht.tex

    \pend:defI\@makePoemTitlehead{\a:PoemTitle}
\pend:defI\printPoemTitlenum{\a:PoemTitlenum}
\append:defI\printPoemTitlenum{\b:PoemTitlenum}
\NewConfigure{PoemTitle}{2}
\NewConfigure{PoemTitlenum}{2}
\let\afterPoemTitlenum\@empty
\let\afterPoemTitle\@empty


We define configurable hooks with \NewConfigure commands. First parameter is name of the configuration, second is number of hooks which should be created. we need two of them, one for code inserted before command, second is inserted after command. These hooks are named \<a,b,c,...>:ConfigureName, so \a:PoemTitle and \b:PoemTitle were created in our case. In this simple case we can use commands \pend:defI and \append:defI to preppend and append these hooks to the redefined command. We could also used \def\@makePoemTitlehead#1{\a:PoemTitle#1\b:PoemTitle}, but we would lose code executed in \@makePoemTitlehead in that case. Generate output files with tex tex4ht-4ht.tex and copy the generated memoir.4ht into the same directory as the tex file that is about to be converted.

Now as we have configured the hooks we can directly go and configure PoemTitle inside tex4ht-ooffice.tex. There was no subsection in tex4ht-ooffice.tex so I added to following code to style PoemTitle.

    \subsection{Memoir}

\<configure ooffice memoir\><<<
|<memoir poemtitle|>
>>>

\<memoir poemtitle\><<<
\NewConfigureOO{PoemTitle}
\ConfigureOO{PoemTitle}{%
<style:style style:name="poemtitle"
style:family="text"
style:class="chapter" >\Hnewline
<style:text-properties
fo:font-size="15pt"
fo:font-weight="normal"
style:font-size-asian="18pt"
style:font-weight-asian="bold"
style:font-size-complex="18pt"
style:font-weight-complex="bold"
fo:text-align="center"
style:justify-single-word="false"/>
</style:style> \Hnewline
}
\Configure{PoemTitle}{%
\ifvmode\IgnorePar\fi\EndP

\HCode{<text:span text:style-name="poemtitle">}\NoFonts}
{\EndNoFonts\HCode{</text:span>\Hnewline}    }

>>>


"\ifvmode\IgnorePar\fi\EndP" will close preceding paragraph, otherwise tag mismatch happens, "\NoFonts" disables further styling of the title, "\\" will start new \verseline. Other styling part should be self-explanatory after reading the previous subsection.

## Fixing Errors with Tables in Memoir

TeX4ht handled tables correctly in article and book classes. But in Memoir it gives an error. So the following code needs to be added to the beginning of memoir.4ht to fix this error.

    % Tables Handling
\input array.4ht
\input dcolumn.4ht
\input tabularx.4ht
\input booktabs.4ht
\let\columnlines\empty


If we don't write \let\columnlines\empty, we will get an error while handling math. First of all we added this piece of code into memoir.4ht for configuring tables because both memoir and tex4ht redefine lot of standard macros. I read somewhere online that memoir also changes standard table macros, in some way which is incompatible with tex4ht, so this is the main reason we add this piece of code to memoir.4ht. So, while doing this there is a bug in tex4ht which gets exposed and there are errors while handling eqn arrays etc. Adding \let\columnlines\empty resolves this error and conversion of \eqnarray,\array happens perfectly from then on.

## Configuring a PDF image converter

TeX4ht doesn't support graphics in pdf format so we have to use a PDF to PNG converter then instruct TeX4ht to use the generated PNG file instead of the PDF file. I added the following code into the \subsection{Graphics} part of tex4ht-ooffice.tex

    \Configure{graphics*}
{pdf}
{\Needs{"convert \csname Gin@base\endcsname.pdf -scale
\expandafter\dim\the\csname Gin@req@width\endcsname\space
"\csname Gin@base\endcsname-\expandafter\the\csname Gin@req@width\endcsname.png"    "}%
\Picture[pict]{\csname Gin@base\endcsname--\expandafter\the\csname Gin@req@width\endcsname.png}%
\special{t4ht+@File: \csname Gin@base\endcsname-\expandafter\the\csname Gin@req@width\endcsname.png}
}
{
\def\dim{%
\catcode\p=12
\catcode\t=12
\gdef\dim}
\dim#1pt{#1}
}


Gin@req@width is a macro from graphicx package and it directly gives us the width of the final image after using the scale given by the user. \Needs{} instructs TeX4ht to write this into the .lg file which is read in the last step of TeX4ht i.e when t4ht is run. So we are converting the given pdf to png with a name originalname-widthOfTheImage.png. And now we now need to tell tex4ht to use this newly generated png file. So I am doing it using \Picture[pict]{} in which I write the name of the png image I want to be used.

## Fixing the issue with incorrect sizes of images

I used a similar approach of the previous subsection to fix this issue. Instead of converting pdf to png like in the previous subsection I just converted the original png image to a new png image with dimensions according to the scale specified. So the original filename.png is converted to filename-widthOfTheImage.png which is inturn used by TeX4ht.

I added the following code below the \Configure{graphics*}{pdf}{} section inside tex4ht-ooffice.tex.

    \Configure{graphics*}
{png}
{\Needs{"convert \csname Gin@base\endcsname.png -scale
\expandafter\dim\the\csname Gin@req@width\endcsname\space
"\csname Gin@base\endcsname-\expandafter\the\csname Gin@req@width\endcsname.png"
"}%
\Picture[pict]{\csname Gin@base\endcsname-\expandafter\the\csname Gin@req@width\endcsname.png  }}

{
\def\dim{%
\catcode\p=12
\catcode\t=12
\gdef\dim}
\dim#1pt{#1}
}


This code is self explanatory after reading the previous subsection on PDF conversion. Similar code needs to be added into tex4ht-ooffice.tex to fix sizing issues with various formats like jpg, jpeg, gif etc

## Issues with Math

In Math there were two main issues, one was that \array, equations and tables with expressions get squashed into a single line. Another issue is that environments like eqnarray, align are not displayed correctly in the resultant ODT file. The first problem was fixed when we added \let\columnlines\empty into the memoir.4ht file. As I mentioned earlier this was an accepted bug in TeX4ht. In the second issue when I tried to render the generated MathML code, equations were displayed correctly. So, that means that correct MathML code is being generated which inturn means TeX4ht is doing nothing wrong. So the problem here is that support for MathML is limited in Libre Office and Open Office which is the source of the issue

In case of Math I had to carry the LaTeX expression as it is into the content.xml of the resultant ODT file. I stored the math expression inside the annotation tag. First task was to identify math exclusively in the tex file. LyX made this task simpler as it has formula inside the formula inset. For example the following is the LyX syntax of displaying an inline formula

    \begin_inset Formula $m$
\end_inset


The following is the syntax for representing other kind of Math

    \begin_inset Formula
\begin{align}
zw & =(3+2i)(2-i)\notag    & =6-3i+4i-2i^{2}\notag    & =8+i\notag
\end{align}

\end_inset


By this syntax of LyX its very easy to identify whether a Formula is an inline formula or not. In simple words if "\n" is found in between \begin_inset Formula and \end_inset its not an inline formula. Now the main task is to put this formula appropriately into the tex file and make TeX4ht understand that this is needs to go into the annotation tag. So, I made a new package named roundtrip and defined two commands, one for handling inline math and one for other math expressions.

The following is the code I wrote inside roundtrip.sty

    \ProvidesPackage{roundtrip}
\usepackage{fancyvrb}
\usepackage{environ}
\usepackage{cprotect}

{\VerbatimEnvironment
\begin{Verbatim}}
{\end{Verbatim}}

\newcommand{\inlinemeta}[1] {\Pre \texttt{#1} \Post}
\cMakeRobust{\inlinemeta}


New Environment metadata is to handle Multiline Math like eqnarray, align etc and the new command \inlinemeta is for handling inlinemath as the name suggests.

Now the task is to put every math expression inside the tex file into one of these two commands. So, I wrote a python script which is called after LyX to LaTeX conversion and before calling mk4ht for LaTeX to ODT conversion. This script first parses the LyX file and identifies all the math expressions and then opens the tex file, searches for that math expression inside that tex file and replaces it in the following way. Lets take an example to understand this better. Take the following math expression inside LyX

    \begin_inset Formula $m$
\end_inset


Now we extract $m$ from here, search for this in the tex file and replace it with $m$ \inlinemeta\verb in the tex file. Incase of a multiline math expression, the following is the LyX syntax.

    \begin_inset Formula
$\vec{p}$

\end_inset


We extract this formula from the lyx file, search it in the tex file and replace it with the following inside the tex file.

    $\vec{p}$
$\vec{p}$


I wrote a python script which does all this which as I mentioned is called before calling mk4ht. Now the final task is to style these commands in roundtrip.4ht which will automatically be considered as we are using roundtrip package. The following is the code in roundtrip.4ht

    \:CheckOption{ooffice}\if:Option

\HCode{<annotation encoding="\jobname-m\math:obj">}}
{\ifvmode \IgnorePar\fi \EndP\HCode{</annotation> \Hnewline}\par}{}{}
\NewConfigure{inlinemeta}[2]{%
\newcommand\Pre{#1}%
\newcommand\Post{#2}}
\Configure{inlinemeta}{  \Configure{texttt}{\NoFonts}{\EndNoFonts}
\Configure{verb}{}{}
\Configure{obeylines}{}{}{}
\HCode{<annotation encoding="\jobname-m\math:obj">}}
{\HCode{</annotation> \Hnewline}}

\fi


Here \jobname-m\math:obj is the decoding link to find out the formula for which this annotation tag belongs to in the content.xml file.

# Future Work

## LyX to ODT

• Abstract environment is not yet converted properly by TeX4ht. I feel its only a styling issue as it is present in the content.xml but absent when we open the ODT file with either Libre Office or Open Office.
• In some cases like math super/subscript to a regular text expression, \align environment, \equarray environment and some cases of composite fractions correct MathML code is produced but they are not displayed correctly in both Libre Office and Open Office
• For the perfect conversion of images TeX4ht needs images in the same directory/folder as the tex source file. Long-term solution is to "flatten" a latex document structure into a single, temporary directory, while exporting to the ODT. This problem is while running mk4ht from the terminal. Read the next point for the bigger problem
• When I add ODT export option to LyX and link the python script which I wrote which first converts LyX to LaTeX and then calls mk4ht for LaTeX to ODT conversion. The same script works when called directly from terminal but doesn't work when called from LyX. When I went into the temp directory and saw all the files, all the image files with the correct sizing are present but in the .lg file some other non-existent images are linked.
• Some edited files like ooffice.4ht, memoir.4ht etc are yet to be added into TeX4ht source code. Other option is to place these files in the same temp directory while the LyX file is being converted.
• Improving parselyx.py which is a script written to inject metadata into the tex file. This script needs to be tested on real life documents should be made more robust

## ODT to LyX

• Cross-references and Bibliography are not yet configured for ODT import. This doesn't seem possible seeing the generated XML after TeX4ht conversion. Many changes have to be made in the way these are styled in TeX4ht to try and make import option work for these environments.
• Poemtitle environment has a minor styling issue after LyX to ODT conversion
• Metadata stored inside the ODT file in the form of annotations is lost once the ODT is modified and saved. As of now there is no way to stop this from happening. So in this case I am bringing the Math in StarMath notation in such cases. There are two ways on tackling this problem, convincing Libre Office/Open Office developers and figure out a way to keep these annotations even after a modifications in the ODT file. Other way is to build a StarMath to LaTeX converter, which will be difficult to maintain.
• An interface needs to be built to access various styles, for example I did mapping of ODT to memoir but something like this should be extended to KomaScript etc as well and an interface should be given to select the appropriate style
• Image quality gets effected after the roundtrip because while fixing the Image sizing issue with TeX4ht I am converting the original image to another image with the corresponding scale.(Have a look the Image Sizing subsection in General Documentation)
• Subfloat only works semantically because of lack of information in content.xml. Suppose there are two subfloats inside a float in the main LyX file, after the roundtrip there will be two images with two captions inside a single float.
• Default settings have been taken in tables i.e width of the columns is fixed to 1in and the topline/bottomline for each row is set to a fixed value.
• Because of some formatting mistakes done by TeX4ht, there are some formatting errors like extra spaces etc in the resultant LyX file. And as mentioned in the previous two points as well focus has been on building a semantic conversion tool. In some cases issue is because of the lack of information in content.xml, so changes can be made in TeX4ht styling but this will not resolve the issue as once the file is modified and saved, Open Office or Libre Office changes the whole structure of all the XML files. This syntax has to be looked into and the conversion needs to be further configured.