This extension is an evolution of my previous macros made for automatisation of document editing. These instruments intended for editors, maker-ups and other users who want to reduce document cleaning time. It also provides validation function for documents intended for publication in HTML. These functions were packed in one extension for user convenience.
Cleaning
By default cleaning is running in simplified mode, which doesn’t require additional configuration and goes through standard cleaning procedures step-by-step. In advanced mode users can choose which steps to apply and which to skip.
Before cleaning any changes tracking is turned of by cleaning function as it prevents document from cleaning correctly.
It is also a good practice to save document as ODT file before cleaning. That is why document is saved right after before cleaning. If you opened document in format differ from ODT you will see a save file dialogue. After cleaning is completed document is saved and reload once again for safety reasons.
Simple cleaning
Simple cleaning is intended for experienced users, who need to make initial document cleaning intended for publication. Simple cleaning is a set of cleaning procedures balanced for achieving simplicity with maximization of cleaning results.
After cleaning is complete, the document properties store information about the date of the cleaning, the add-on version, the author's name, and the LibreOffice version. In addition, versions of the document retain the current state of the document.
In this mode by default following procedures apply:
Font subtitutions in styles
These substitutions in styles are used by default to eliminate copyrighted fonts and with minor flaws.
Latest IPH Astra Serif font could be downloaded by link.
Deprecated font “IPH Lib Serif” replaced by “IPH Astra Serif”.
“Liberation Serif” replaced with “IPH Astra Serif”, which is easier to read.
PTSerif is replaced by “IPH Astra Serif” with diacritical marks.
Deprecated font ArabicD is replaced by “IPH Astra Serif” font.
Frequently used copyrighted font “Palatino Linotype Greek” is replaced by “Tinos” font.
Font subtitutions for Unicode ranges
For Latin and Cyrillic alphabet apply “IPH Astra Serif” font
For Arabic symbols apply “Scheherazade” font
For Greek symbols apply “Tinos” font
For math operators apply “DejaVu Sans” font
For Chines, Japan and Korean hieroglyphs apply “Noto Serif CJK JP” and “Noto Serif CJK SC” fonts
Direct formatting cleaning
All direct formatting is removed, except following: bold, italic, underline, strike through, superscript, subscript, character spacing with values 0.5, 1, 1.5, 2.
White background in text removal
White background is invisible in Writer but could be visible in html, so it should be replaced with transparent background.
Unused styles removal
As unused styles don't have any value and could be a burden while scrolling in styles tab it is usually a good choice to remove them.
Hyperlinks removal
Document shouldn’t have nor hidden links, nor links pasted by mistake from web pages. All hyperlinks are removed at this stage.
Bookmarks removal
All bookmarks are removed at this stage.
Table configuration
Tables should have relative width to be adaptive for different screens. At this stage table width settings are set to relative.
Image anchors
As there is only one real page in HTML images shouldn’t have anchor set “to page”. Because of that all images with anchor setting “to page” at this stage are being reconfigured to have “to paragraph” anchor setting.
Typewriting mistakes fixes
Tabulation can’t be converted to HTML, so it should be removed.
More than to spaces are being replaced by one.
All paragraph leading spaces are being removed.
All spaces at the end of paragraphs are being removed.
All empty paragraphs are being removed.
All spaces before punctuation marks are being removed.
Spaces after opening brackets are being removed.
Space added between text and opening angle bracket.
Space added between closing angle bracket and text.
In text hyphen-minus, figure dash and em dash replaced with en dash.
In text spaces added before and after En dash.
(Update: disabled as it causes unwanted replacements in links and DOIs) Spaces between numbers and dashes are being removed. Dash is being set to figure dash.
Space between N. and Y. also removed. N. Y. → N.Y
Following rules apply for Russian:
Space between initials removed as shown below:
А.[possible space]А. Иванов → А.А. Иванов
Иванов А.[possible space]А. → Иванов А.А.
In followings spaces also removed:
и т. д. → и т.д.
и т. п. → и т.п.
т. к. → т.к.
т. е. → т.е.
т. н. → т.н.
Symbols и/И with following « combining breve» replaced with й/Й
Symbols е/Е with following «combining diaeresis» replaced with ё/Ё
Manual page break at document start removal
Manual page break is invisible if it is placed at the start of the document and usually useless. On the other hand it could induce problems at stage of making-up. That’s why it is recommended for removal in most cases.
Custom page styles removal
All custom page styles are being removed. Page styles should be defined
Loading styles from template
It is a good practice to have all documents have the same look as our eyes get used to text formatting and fonts. This stage loads predefined styles from template document. Predefined styles replace document’s initial style definitions. It relies on style naming conventions.
If input document have custom styles, which were not defined in template, then they won’t be changed. In that case styles should be assigned by hand after cleaning.
Basic macro removal
Usually no macro is needed in articles or books, as input documents by default should only have text and images. To clean macros from documents this stage occur.
Advanced mode
Advanced mode intended for professionals and advanced users and provides a selection for cleaning stages to apply or to skip.
To enable advanced mode ePublishing extension should be installed, last version of which is available here or at LibreOffice Extensions website https://extensions.libreoffice.org/extensions/epublishing. When installing from Libreoffice extensions website, be sure to update the extension, since not the latest versions are often published on it. You can update an extension through the menu Tools → Manage extensions → Check for updates.
Advanced mode could be enabled via menu ePublishing → «Configure cleaning»
In this mode you also have access for additional cleaning stages described below.
Manual page breaks removal
If no manual page breaks allowed in the document, they could all be removed with this function.
Resetting chapter numbering settings
In the chapter numbering settings, you can set the text that will be displayed at the beginning and at the end of the headings. This procedure removes this text from the chapter numbering settings, and set character style of that text to None.
Validation
Is document eligible for HTML publishing? It depends on various aspects, like symbol codes, table and images preferences. A lot of this is not visible for editor until html export stage is done and document is tested on various devices. To eliminate most of problems occur with exported documents this validation function has been made. It saves a lot of time for testing and makes process of making up of HTML much more stable and reliable. Current checks made by this function described below.
Check for symbols in text
Symbols should be checked for membership to Unicode “private use area”. Symbols from “private use area” are not recommended for HTML export as correct display depends on availability of initial font. It is better to replace symbols from that area with standardized Unicode symbols.
Footnote symbols check
Footnote symbols should also be checked for membership to “private use area”. As that symbols are not recommended for HTML export it is better to replace them with standardized symbols.
Numbering styles check
Numeration markers should also be checked for membership to “private use area”. As that symbols are not recommended for HTML export it is better to replace them with standardized symbols. In advanced mode additional information provided.
Check for drawings and embedded objects
Currently conversion to HTML doesn’t support nor drawings made in Writer, nor any embedded objects except formulas. Supported formats are JPEG, PNG, TIF, SVG.
Outline validation
For most cases it is not appropriate to use headings in footnotes, endnotes, headers and footers or tables as it could be a problem to separate document by this headings in future. Because LibreOffice Writer technically allows to do this kind of formatting, this stage verifies that it is not present in current document.
Report about font symbols in PDF
While exporting to PDF it could be important for you to track font names exported in PDF. Further publication of such a document may entail claims from the copyright holders of the fonts if the publisher does not have a license to use them.
This function creates a list of symbols in font presented in PDF document and page of their first occurrence in the document. This information could significantly help to get rid of undesirable fonts in PDF document by replacing font names for symbols in source document.
To use this function open PDF document with LibreOffice Draw. Then click on icon in Toolbar as in heading of this section. Function will analyse document and let you choose font name to make report. You will see a new document with report at the end of the process.
Extension installation
You can download latest release of this extension here (cleanAndValidate.oxt)
This extension also could be downloaded from LibreOffice Extensions website. When installing from Libreoffice extensions website, be sure to update the extension, since not the latest versions are often published on it. You can update an extension through the menu Tools → Manage extensions → Check for updates.