Summary
This component helps enhancing your content with auto generated summaries from DOCX files. It is combined with the download component, so the output rendered will be a summary of the DOCX file and a download button to download the original file. The summary of a given DOCX file and will be saved in the same folder as the source DOCX. As much as possible, the component generates also a simple ToC of the summary to enable easy navigation for larges summaries.
We cannot guarantee 100% the quality of the summary generated by this component as this depends on various parameters (such as: the model used, the language, the correct spelling and grammar, the structure of inputs). For this reason we advice to review and do corrections as needed before deploying the result to your production site. After applying manual corrections is necessary to build and deploy the site again. The summary will not be generated one more time, the previously corrected text will be used to render the content.
We use local downloaded models from Huggingface. Your data will be not sent to Huggingface to be used for training the models. This implies that updating the models to the last version is not automatically and, when needed, should be made manually by removing the specific cached model
from ~/.cache/huggingface/hub/*
on MacOS/Linux or from C:\Users\<YourUsername>\.cache\huggingface\hub
on Windows. We recommend regular checkings on Huggingface to identify if there are updates for the models used.
Usage
Generating DOCX summaries can be a time/resources consuming task. For this reason we strongly recommend to build sour site incrementally when using this component in many documents. Remeber that, once a summary was generated and exists in the site, will be not generated again at subesequent builds even if manual corrections are applied on it betweeen builds. So, if you plan to use this component multiple times in your documents, do it gradually, document by document. Do not include it from the first step in all documents you want. Do it with a document, build the site and then go to the next document.
Do not attempt to build you site directly on GitHub pages when using this component. By default, the deployment action does not allow this but can be easlily modified to do so. The build time can be very long and will consume your build minutes. Always build locally and deploy it afterwards.
DOCX vs. PDF
The quality of a DOCX summary depends very much on the structure of the document. If the docx contains well defined sections (having Heading 1
titles) there is a good chance to obtain a quality summary, including a relevant ToC for the summary. On the other hand, DOCX summaries may be shorter than PDF summaries. thus containing less information.
For DOCX summaries it is not possible to generate a picture of the first page because Word is building pages at runtime based on different parameters. We wanted to be platform independent, thus generating the picture of the docx first page would have involved to have Word and some system libraries installed on the computer where the site is built.
Example
We will use the same files in docx
and pdf
formats to allow simple comparation of the quality of generated summaries
The next examples are based on the following files:
📁 docx-summary/
├── 📄 docx-summary.md
├── 📄 pe.docx
├── 📄 pe__word_summary.txt
├── 📄 pr.docx
├── 📄 pr__word_summary.txt
├── 📄 pt.docx
└── 📄 pt__word_summary.txt
The model used for this summarisation is:
word_sum_model: "facebook/bart-large-cnn"
In the first example we have a generated summary without any manual correction to allow understanding the limitations of the used model (which is one of the best rated open source and multilanguage summarisation model for medium and large texts). However, the model can be replaced easily in _data/buildConfig.yml
configuration file. In case of a multilanguage documentation site, it is even possible to set a different model for each language. Be aware that not all open source models can work in this summarisation context. If a non compatible model is set, the raised error message (at build time) will give a list of compatible models to be chosen from Huggingface
.
Note that the DOCX summaries are saved in the same folder as the original DOCX file and are named <docx_file_name>__word_summary.txt
. When building the site, as long as a file respecting the naming convention is found in the folder, the summary for <docx_file_name>.docx
will be not generated again. Feel free to apply any manual correction to the summary file for removing not relevant paragraphs and/or model hallucinations. Be aware that, after each manual correction, the site must be built again. If is needed to force re-generating the summary, just delete the <docx_file_name>__word_summary.txt
and build the site again.
Keep in mind that re-generation of the summary will lose any previously applied manual corrections.
DOCX 1
The next example demonstrates a generated summary without any subsequent manual correction. Observe that the quality of the summary is much better than the summary generated for the pdf version of the same document. Except for a small correction in the last paragraph (sometimes the model doesn’t interpret well the abbreviations), the docx summary do not require heavy manual corrections.
{% include elements/docx-summary.html
file="_experiments/docx-summary/pr.docx"
btnType="danger"
btnOutline="false"
btnText="Download"
sBorder="true"
sh="300px"
%}
START DOCX SUMMARY
This Progress Report was drafted in accordance with the provisions of the ToR (§7.1-Reporting Requirements) and with the Revised Organisation and Methodology (ANNEX 6 to the Inception Report). The reporting period is October – December (incl.) 2024, containing Quarter 8 of the implementation.
Within the reporting period the scope of the work was extended with the activities that will allow us to assist the beneficiary for achieving the operational readiness.
This Progress Report was drafted in accordance with the provisions of the ToR (§7.1-Reporting Requirements) and with the Revised Organisation and Methodology. The project status presented in this report will be the new reporting baseline as of 1st January 2025.
The reporting period for this Progress Report is 01 October 2024 – 31 December 2024. The new end date of the contract is 16 July 2026. Sections 5, 6 and 7 will present the full scope together with the overall planning and log frame.
The team mobilisation was performed after the contract signature. The focus was to appoint the roles that will enter implementation in the on-going phases of the project. For the other roles, we created a pool of experts to be presented to the Beneficiary. Steering Committee Meeting took place in Birnin Zana on 02 October 2024.
The Beneficiary was informed about the risks of switching later to an official Google developer account belonging to a public administration entity from Wakanda. We informed the Steering Committee that the other comments and findings from the first SAT session (all of them being no blocking issues, cosmetic issues, UX issues) were incorporated in the final version of the eID system.
eID API is to be used by third parties for integrating eID features in external systems. The Government must know at any time (and, when the case, must be able to immediately react and restrict/block any potential wrong usage of APIs) who, how, why and using what means has access to API. Final Site Acceptance Testing session was carried out in Birnin Zana, on 6th and 7th of November 2024.
Technical difficulties described in the Installation and Configuration Report drafted for the previous SAT session (July 2024) were solved. The conclusion is that the eID system is fully operable from the technical point of view. A dedicated API workshop and testing session was organised in Birnin Zana on 4th and 5th of December 2024.
The conclusion is that the eID system provides all needed API for integrating eID features into third party systems. Wakanda received through this project a fully featured electronic wallet together with the possibility to use it for the use cases related to eID.
It was clear that a well-defined formal and technical API control and monitoring framework must be put in place to provide, at any time, the information about who, why, how and when uses the APIs. eID features (authentication, signature, seal) are all supported by the APIs and allow the implementation of a wide range of use cases.
The Full plan for achieving Operational Readiness (IMP.OR.D10) identifies the gaps between the situation at the end of the technical implementation (technical readiness) and the TO-BE situation now when the operational readiness will be achieved. The eID system is hosted by the Agency for Informational Society (WISA) which activates under the Ministry of Interior (WMI).
The main internal users are the Registration Officers who belongs to the Civil Registration Agency (WKCA) for issuing the digital certificates for citizens. There is no single body to coordinate all these stakeholders and to properly maintain the eIDs system. The needed resources (human and material) necessary for providing the support services to the end-users (citizens and RAO) are not identified. The necessary procedures and tools are not in place.
WKBR employees, issuing digital certificates (eSeal) is a new task and their offices across the country are not equipped to provide such services.
Even from the Steering Committee Meeting held in Birnin Zana we raised the issue of eID “ownership” which means the formal appointment of the eID operator. It is illustrated in the above-mentioned table that we will carry out some design activities in the first quarter of 2025. In any case, the risk impact on achieving the Operational Readiness is H (high), even blocking for some activities.
During the reporting period the scope of the work was extended. This generated modifications of the Work Breakdown Structure in the sense of adding phases/implementation chapters/deliverables. The new WBS is presented in this section. For the updated Work Break down Structure, we used the following colour convention:
For the different log frame components (objectives, outputs activities), we use the following colour convention to mark the vertical bars for identifying the mentioned components. For the updated status of indicators/means of verification on the Log Frame we used the following colours to highlight the progress. At the date of this report, the progress on the log frame is shown in the table below:Table 3: Log frame progress.
The summary of the workplan is: Technical readiness was achieved in December 2024. The contract was extended until July 2026. L2 and L3 support will be provided to the Beneficiary by the end of the contract. Operational readiness will be achieved in 2025.
The eID operator will be able to operate the system under all aspects.
The next reporting period is Q9 (January, February, March 2025) Next reporting period will include also the period necessary for drafting the next Progress Report (MC.PR-9) The expected new deliverables will be:.- Design of the Support Organisation (IMP.TS.D11-1)- Full design of the Tools for Technical Support (as part of IMP.
TS. D11-2)- Implementation of the tools needed by the eID operator to provide technical support to the end-users (citizens and RAO)
END DOCX SUMMARY
DOCX 2
The quality of the summary depends very much on the way in which the original document is structured and written. The next example demonstrates a new generated summary, without any additional correction, but applied on a document having a different structure and generating a better quality text. Note that, for this structure of docx, the quality of docx summary is similar to the quality of the pdf summary (maybe small corrections like removing sections titles numbering must be applied here).
{% include elements/docx-summary.html
file="_experiments/docx-summary/pt.docx"
btnType="danger"
btnOutline="false"
btnText="Download"
sBorder="true"
sh="300px"
%}
START DOCX SUMMARY
This article is not a scientific paper, but a template file and guidelines for helping authors prepare their scientific papers. As illustrated in this article, the structure of manuscripts is: Paper header, title, authors, affiliations, abstract, type of paper and keywords, main text, acknowledgements, references, appendix and biographies.
The RonPub Journal Paper Template can be used to help you with your writing. For more information on the RonPub journal, visit: http://www.ronpub.org/.
Paper header, title, authors, affiliations, abstract, type of paper and keywords, main text, acknowledgements, references, appendix and biographies. Main text is a large as well as major part of a paper. A separate section is needed to describe the structure and content guidelines for the main text. Subsections should be divided into clearly defined and numbered sections.
The main text consists of multiple sections, and typically includes: Introduction, related work, own contribution, discussion and conclusion. The formatting and styling has been setup for sections, subsections and subsubsections in this template document. For further subdivided subsections is the same as one for subsubsection.
This template has been tailored for output on the A4 paper size (8.3in x 11.7in/210mm x 297mm) The columns on the last page should be as close as possible to equal length. Authors should follow the formatting and styling which have been set up in this template.
Citation of a reference in the text should be identified by its number in square brackets. The actual authors can be referred to, but the reference numbers must always be given. Necessary footnotes should be numbered using Arabic numerals. As a minimum for web references and online documents, the full URL should be given and the date when the reference was last accessed.
This template document prescribes the format, style and structure of scientific papers. All manuscripts for RonPub journals should comply with this template. Appendices are optional. They should be placed after the references and before the author biographies.
If there is more than one appendix, they should be identified as A, B, etc. Author Biographies. A biography for each author should be supplied here. Each author please provide a photograph in her or his biography.
The author photograph should have a width of 3 cm. The biography should not be less than 70 words. For more information, visit the Author Biographies page.
END DOCX SUMMARY
DOCX 3
The next example demonstrates a new generated summary, without any additional correction, but applied on a document having a different structure. Note that, for this structure of docx, the summary requires some manual corrections, but the quality is still better than the one of the pdf document and the sections were identified automatically while in the pdf summary the sections titles were added/modified with manual correction.
{% include elements/docx-summary.html
file="_experiments/docx-summary/pe.docx"
btnType="danger"
btnOutline="false"
btnText="Download"
sBorder="true"
sh="300px"
%}
START DOCX SUMMARY
Paper setup must be in A4 size with Margin: Top 1.78 cm, Bottom 1.74 cm, Left 1.65 cm, Right 1.66 cm, Gutter 0.63 cm.
Paper must be one Columns after Authors Name. Whole paper must be with: Font Name Times New Roman, Font Size 10, Line Spacing 1.05, indentation 0.36 cm first line EXCEPT Abstract, Keywords, Paper Title, References, Author Profile and Manuscript Details.
Highlight a section that you want to designate with a certain style, then select the appropriate name on the style menu. The style will adjust your fonts and line spacing. Do not change the font sizes or line spacing to squeeze more text into a limited number of pages. Use italics for emphasis; do not underline.
When you submit your final version, after your paper has been accepted, prepare it in two-column format. The authors of the accepted manuscripts will be given a copyright form and the form should accompany your final submission. To insert images in Word, position the cursor at the insertion point and either use Insert | Picture | From File or copy the image to the Windows clipboard.
If you are using Word, use either the Microsoft Equation Editor or the MathType add-on (http://www.mathtype.com) “Float over text” should not be selected. For more information on how to write equations in your paper, visit: www.pennlive.com/how-to-write-equations.
Use either SI (MKS) or CGS as primary units. English units may be used as secondary units (in parentheses) Avoid combining SI and CGS units, such as current in amperes and magnetic field in oersteds. If you must use mixed units, clearly state the units for each quantity in an equation.
Use the abbreviation “Fig.” even at the beginning of a sentence. Place figure captions below the figures; place table titles above the tables. Do not use color unless it is necessary for the proper interpretation of your figures.
Figure axis labels are often a source of confusion. Use APA reference style. The in-text citation can take two forms: parenthetical and narrative. Papers that have not been published should be cited as “unpublished” Papers that are submitted for publication should be citing as ‘submitted for publication’ Please give affiliations and addresses for private communications.
Number equations consecutively with equation numbers in parentheses flush with the right margin, as in (1) Do not use abbreviations in the title unless they are unavoidable. Hyphenate complex modifiers: “zero-field-cooled magnetization.” Indicate sample dimensions as “0.1 cm 0.2 cm,” not ‘0.10.2cm2’ A parenthetical statement at the end of a sentence is punctuated outside of the closing parenthesis.
In American English, periods and commas are within quotation marks. Avoid contractions; for example, write “do not” instead of “don’t” If you wish, you may write in the first person singular or plural.
The word “data” is plural, not singular. The term for residual magnetization is “remanence” Do not confuse “imply” and “infer” There is no period after the ‘et’ in the Latin abbreviation “et al.” An excellent style manual and source of information for science writers is [9]. The abbreviation “i.e.,” means ‘that is,’ and the abbreviation "e.g.,’ means “for example” (these abbreviations are not italicized).
The submitting author is responsible for obtaining the agreement of all coauthors and any consent required from sponsors before submitting a paper. It is the obligation of the authors to cite relevant prior work. Authors of rejected papers may revise and resubmit them to the journal again. For confidential support call the Samaritans on 08457 90 90 90, visit a local Samaritans branch or see www.samaritans.org.
Technical papers submitted for publication must advance the state of knowledge and must cite relevant prior work. The length of a submitted paper should be commensurate with the importance, or appropriate to the complexity, of the work. Authors must convince both peer reviewers and the editors of the scientific and technical merit of a paper.
The preferred spelling of the word ‘acknowledgment’ in American English is without an “e” after the “g” Use the singular heading even if you have many acknowledgments. Authors declare that they do not have any conflict of interest. (2002), Auditing in Australia: an integrated approach. 5th ed.
Frenchs Forest: Pearson Education Australia. Simons, N.E., Menzies, B., Matthews, M. A.
(2001). Short Course in Soil and Rock Slope Engineering. London: Thomas Telford Publishing. Dillard, J. P. (2020).
Currents in the study of persuasion. In M. Oliver, A. Raney, & J. Bryant (Eds), Media effects: Advances in theory and research.
Routledge. Forneau, E., Bovet, D. (1933).
Recherches sur l'action sympathicolytique d'un nouveau dérivé du dioxane. Arch Int Pharmacodyn 46:178–191 French. Wikipedia (2021, May 28). Introduction to general relativity.
National Institute of Mental Health (2018, July). Anxiety disorders.
END DOCX SUMMARY
There is no general rule to establish if a certain DOCX will generate a better quality summary. The best way to use this experiment is to start with the auto-generated summary, review it with care and apply manual corrections where needed.
Corrections
As described, corrections can be manually applied to the generated summaries for removing not relevant paragraphs or model hallucinations. The summarisation algorithm is set to generate rather detailed summaries to allow choosing the relevant parts for rendering to the document. However, docx summarisation is usually more structured than pdf ones because structure extraction from a pdf is looser than in docx where it can be hooked to the headings from the docx. On the other hand, pdf summaries may contain more relevant details.
When working with docx, it is a good practice to compare the summary generated for the original docx with the summary generated for the same document converted to pdf and choose the one that suits better your purpose.
Limitations
Tables, images and sections like Table of Contents, List of Tables, List of Figures are usually excluded from summarisation. To the best extent possible, annotation, comments, references, bibliography, citations and similar sections are also excluded.
The models may not always detect this kind of sections in an accurate way, this being another reason for which we strongly recommend checking the generated summaries and apply manual corrections as needed. Let’s say that summarisation makes up to 90% of the job, but the remaining can make the difference.
Parameters
file
: path to the DOCX file provided as relative path from the root of doc-contents folderbtnType
: type of the download button, default value isprimary
. SeeDownloads
.btnOutline
: type of the outline of the download button, default value isfalse
. SeeDownloads
.btnText
: text on the download button, default value isDownload
. SeeDownloads
. Note that there is not automatic translation of this label. Since it can be set as parameter, is needed to be manually adapted to the site language.sBorder
: the DOCX summary block has or not a thin left border, default value isfalse
sh
: the DOCX summary block has or not a fixed height, default value isauto
.
Since summaries can be pretty long sometimes, we recommend to set a fixed height to increase the readibility of the document and the UX when reading.
On this page