This extension Image1 is an evolution of my previous macros made for automatisation of document editing. These instruments intended for editors, maker-ups and other users who want to reduce document cleaning time. It also provides validation function for documents intended for publication in HTML. These functions were packed in one extension for user convenience.

Cleaning Image2

By default cleaning is running in simplified mode, which doesn’t require additional configuration and goes through standard cleaning procedures step-by-step. In advanced mode users can choose which steps to apply and which to skip.

Before cleaning any changes tracking is turned of by cleaning function as it prevents document from cleaning correctly.

It is also a good practice to save document as ODT file before cleaning. That is why document is saved right after before cleaning. If you opened document in format differ from ODT you will see a save file dialogue. After cleaning is completed document is saved and reload once again for safety reasons.

Simple cleaning

Simple cleaning is intended for experienced users, who need to make initial document cleaning intended for publication. Simple cleaning is a set of cleaning procedures balanced for achieving simplicity with maximization of cleaning results. In this mode by default following procedures apply:

Font subtitutions in styles

These substitutions in styles are used by default to eliminate copyrighted fonts and with minor flaws.

Latest IPH Astra Serif font could be downloaded by link.

Deprecated font “IPH Lib Serif” replaced by “IPH Astra Serif”.

“Liberation Serif” replaced with “IPH Astra Serif”, which is easier to read.

PTSerif is replaced by “IPH Astra Serif” with diacritical marks.

Deprecated font ArabicD is replaced by “IPH Astra Serif” font.

Frequently used copyrighted font “Palatino Linotype Greek” is replaced by “Tinos” font.

Font subtitutions for Unicode ranges

For Latin and Cyrillic alphabet apply “IPH Astra Serif” font

For Arabic symbols apply “Scheherazade” font

For Greek symbols apply “Tinos” font

For math operators apply “DejaVu Sans” font

For Chines, Japan and Korean hieroglyphs apply “Noto Serif CJK JP” and “Noto Serif CJK SC” fonts

Direct formatting cleaning

All direct formatting is removed, except following: bold, italic, underline, strike through, superscript, subscript, character spacing with values 0.5, 1, 1.5, 2.

White background in text removal

White background is invisible in Writer but could be visible in html, so it should be replaced with transparent background.

Unused styles removal

As unused styles don't have any value and could be a burden while scrolling in styles tab it is usually a good choice to remove them.

Hyperlinks removal

Document shouldn’t have nor hidden links, nor links pasted by mistake from web pages. All hyperlinks are removed at this stage.

Bookmarks removal

All bookmarks are removed at this stage.

Table configuration

Tables should have relative width to be adaptive for different screens. At this stage table width  settings are set to relative.

Image anchors

As there is only one real page in HTML images shouldn’t have anchor set “to page”. Because of that all images with anchor setting “to page” at this stage are being reconfigured to have “to paragraph” anchor setting.

Typewriting mistakes fixes

Tabulation can’t be converted to HTML, so it should be removed.

More than to spaces are being replaced by one.

All paragraph leading spaces are being removed.

All spaces at the end of paragraphs are being removed.

All empty paragraphs are being removed.

All spaces before punctuation marks are being removed.

Spaces after opening brackets are being removed.

Space added between text and opening angle bracket.

Space added between closing angle bracket and text.

In text hyphen-minus, figure dash and em dash replaced with en dash.

In text spaces added before and after En dash.

Spaces between numbers and dashes are being removed. Dash is being set to figure dash.

Space between N. and Y. also removed. N. Y. → N.Y

Following rules apply for Russian:

Space between initials removed as shown below:

А.[possible space]А. Иванов → А.А. Иванов

Иванов А.[possible space]А. → Иванов А.А.

In followings spaces also removed:

и т. д. → и т.д.

и т. п. → и т.п.

т. к. → т.к.

т. е. → т.е.

т. н. → т.н.

Symbols и/И with following « combining breve» replaced with й/Й

Symbols е/Е with following «combining diaeresis» replaced with ё/Ё

Manual page break at document start removal

Manual page break is invisible if it is placed at the start of the document and usually useless. On the other hand it could induce problems at stage of making-up. That’s why it is recommended for removal in most cases.

Custom page styles removal

All custom page styles are being removed. Page styles should be defined

Loading styles from template

It is a good practice to have all documents have the same look as our eyes get used to text formatting and fonts. This stage loads predefined styles from template document. Predefined styles replace document’s initial style definitions. It relies on style naming conventions.

If input document have custom styles, which were not defined in template, then they won’t be changed. In that case styles should be assigned by hand after cleaning.

Basic macro removal

Usually no macro is needed in articles or books, as input documents by default should only have text and images. To clean macros from documents this stage occur.

Advanced mode

Advanced mode intended for professionals and advanced users and provides a selection for cleaning stages to apply or to skip. In this mode you also have access for additional cleaning stages described below.

Manual page breaks removal

If no manual page breaks allowed in the document, they could all be removed with this function.

Validation Image3

Is document eligible for HTML publishing? It depends on various aspects, like symbol codes, table and images preferences. A lot of this is not visible for editor until html export stage is done and document is tested on various devices. To eliminate most of problems occur with exported documents this validation function has been made. It saves a lot of time for testing and makes process of making up of HTML much more stable and reliable. Current checks made by this function described below.

Check for symbols in text

Symbols should be checked for membership to Unicode “private use area”. Symbols from “private use area” are not recommended for HTML export as correct display depends on availability of initial font. It is better to replace symbols from that area with standardized Unicode symbols.

Footnote symbols check

Footnote symbols should also be checked for membership to “private use area”. As that symbols are not recommended for HTML export it is better to replace them with standardized symbols.

Numbering styles check

Numeration markers should also be checked for membership to “private use area”. As that symbols are not recommended for HTML export it is better to replace them with standardized symbols. In advanced mode additional information provided.

Check for drawings and embedded objects

Currently conversion to HTML doesn’t support nor drawings made in Writer, nor any embedded objects except formulas. Supported formats are JPEG, PNG, TIF, SVG.

Extension installation

This extension could be downloaded from LibreOffice Extensions website https://extensions.libreoffice.org/extensions/clean-and-validate-for-publishing-with-pagination

To enable advanced mode ePublishing extension should be installed which is allow available at LibreOffice Extensions website  https://extensions.libreoffice.org/extensions/epublishing

Advanced mode could be enabled via menu ePublishing → «Configure cleaning»