Styling Microsoft Word output
Tip
|
Summary
|
Word output in the document toolset is generated through Word HTML, the variant of HTML that you get when you save a Word document as HTML. (That is why documents are saved in .doc
, not .docx
.) This has the advantage over OOXML, the native markup of DOCX, of using a well-known markup language, with a low barrier to entry: if you want to work out how to do something in Word HTML, do it in Word, save the document as HTML, and open up the HTML in a text editor. (For more on the choice of using Word HTML, see https://github.com/metanorma/html2doc/wiki/Why-not-docx%3F.)
However Word HTML is not quite the HTML you are used to: it is a restricted, syntactically idiosyncratic variant of HTML 4, with a non-standard and weakened form of CSS. Doing any styling in Word HTML involves lots of trial and error, and paying close attention to how Word HTML does things in its CSS. We have documented a few of the clearer gotchas in https://github.com/metanorma/html2doc/blob/main/README.adoc.
It’s still better than learning OOXML.
The process for generating Word output is fairly similar to that for generating HTML, since both processes are generating a form of HTML; as we already noted, the two processes share a substantial amount of code. The main differences are in the handling of page-media features that CSS has lagged in (footnotes, pagination, headers and footers), and in the styling of lists, for which Word HTML uses custom (and undocumented) CSS classes prefixed with @
, specifying inter alia the numbering for nine levels of nesting of the same list.
-
Styling information is stored in the
…/lib/isodoc/html
folder of the gem, and applies to both Word and HTML content. For Word content, the relevant files areword_…titlepage.html
(title page HTML template),word
…_intro.html
(introductory HTML template, typically restricted to Table of Contents),wordstyle.css
and{name_of_standard}.css
(the Word stylesheets), andheader.html
(document headers, footers, and endnote/footnote separators, referenced from the stylesheets). -
The title and introductory files may contain VML graphics, the tags for which are prefixed in Word HTML by
v:
; this applies in particular to text boxes (e.g.<v:textbox>
). For any such prefixed tags to be recognized properly by the XML processor in Metanorma, the HTML content of the files must be in XHTML. Unfortunately, the Word output is anything but XHTML. To make it compliant:-
Any HTML escapes must be replaced by Unicode entities; e.g.
-
Any standalone tags must be closed; e.g.
<br>
becomes<br/>
-
Any attributes must be quoted; e.g.
<p class=MsoNormal>
becomes<p class="MsoNormal">
-
Idiosyncratic Word HTML comments, used for fallback HTML if the HTML is not rendered in Word, are not well-formed XML and must be removed (e.g.
<![if !mso]>
,<![endif]>
)
-
-
The styling files to be loaded in are set in the
default_file_locations()
method ofIsoDoc::…::WordConvert
. -
As with HTML generation, additional files (e.g. logos) can be loaded in the
initialize()
method ofIsoDoc::…::WordConvert
. Theinitialize()
method also sets the@
styles in the stylesheet to be used for unordered and ordered lists; a single such style is intended to capture the behaviour of all levels of indentation. -
As with HTML output, the HTML templates are populated through Liquid Templates: variables in
{{
correspond to the hash keys for metadata extracted inIsoDoc::…::Metadata
, and its superclassIsoDoc::Metadata
in the isodoc gem. See Metadata and Predefined text for more information. -
As with HTML, the SCSS stylesheets treat fonts as variables, and are set in the
default_fonts()
method ofIsoDoc::…::WordConvert
. -
Document headers and footers are set in the
header.html
file. This is also an HTML template, which is populated with metadata attributes through Liquid Template. The structure ofheader.html
is determined by Word, and elements ofheader.html
need to be cross-referenced from the Word stylesheet. To inspect Wordheader.html
files, save a Word document as HTML, and look inside the{document_name}.fld
folder generated alongside the HTML output.
A Word HTML document is populated as follows:
* HTML Head wrapper (in IsoDoc::WordConvert
)
@wordstylesheet
CSS stylesheet (generated from SCSS through the generate_css()
method of Isodoc::WordConvert
); corresponds to wordstyle.css
.
@standstylesheet
CSS stylesheet (generated from SCSS through the generate_css()
method of Isodoc::WordConvert
); intended to override any generic CSS in @wordstylesheet
. Optional, corresponds to {name_of_standard}.css
.
* HTML Body
@wordcoverpage
HTML template (optional, corresponds to word_…titlepage.html
). Included in <div class="WordSection1">
.
@htmlintropage
HTML template (optional, corresponds to word
…_intro.html
). Included in <div class="WordSection2">
. In the existing gems, WordSection2 is paginated with roman numerals.
** Document proper (converted from Metanorma XML). Included in <div class="WordSection2">
(prefatory material) and <div class="WordSection3">
(main document). In the existing gems, WordSection3 is paginated with roman numerals.