DOCX Summary
15 min. read
component, helps, enhancing, content, auto, generated, summaries, docx, files, combined
Last update: 17-Aug-2025
Tags Related Summary
This component helps enhancing your content with auto generated summaries from docx files. It is combined with the download component, so the output rendered will be a summary of the docx file and a download button.

Summary

This component helps enhancing your content with auto generated summaries from DOCX files. It is combined with the download component, so the output rendered will be a summary of the DOCX file and a download button to download the original file. The summary of a given DOCX file and will be saved in the same folder as the source DOCX. As much as possible, the component generates also a simple ToC of the summary to enable easy navigation for larges summaries.

We cannot guarantee 100% the quality of the summary generated by this component as this depends on various parameters (such as: the model used, the language, the correct spelling and grammar, the structure of inputs). For this reason we advice to review and do corrections as needed before deploying the result to your production site. After applying manual corrections is necessary to build and deploy the site again. The summary will not be generated one more time, the previously corrected text will be used to render the content.

Usage

Generating DOCX summaries can be a time/resources consuming task. For this reason we strongly recommend to build sour site incrementally when using this component in many documents. Remeber that, once a summary was generated and exists in the site, will be not generated again at subesequent builds even if manual corrections are applied on it betweeen builds. So, if you plan to use this component multiple times in your documents, do it gradually, document by document. Do not include it from the first step in all documents you want. Do it with a document, build the site and then go to the next document.

DOCX vs. PDF

The quality of a DOCX summary depends very much on the structure of the document. If the docx contains well defined sections (having Heading 1 titles) there is a good chance to obtain a quality summary, including a relevant ToC for the summary. On the other hand, DOCX summaries may be shorter than PDF summaries. thus containing less information.

For DOCX summaries it is not possible to generate a picture of the first page because Word is building pages at runtime based on different parameters. We wanted to be platform independent, thus generating the picture of the docx first page would have involved to have Word and some system libraries installed on the computer where the site is built.

Example

The next examples are based on the following files:

📁 docx-summary/
├── 📄 docx-summary.md
├── 📄 pe.docx
├── 📄 pe__word_summary.txt
├── 📄 pr.docx
├── 📄 pr__word_summary.txt
├── 📄 pt.docx
└── 📄 pt__word_summary.txt

The model used for this summarisation is:

word_sum_model: "facebook/bart-large-cnn"

In the first example we have a generated summary without any manual correction to allow understanding the limitations of the used model (which is one of the best rated open source and multilanguage summarisation model for medium and large texts). However, the model can be replaced easily in _data/buildConfig.yml configuration file. In case of a multilanguage documentation site, it is even possible to set a different model for each language. Be aware that not all open source models can work in this summarisation context. If a non compatible model is set, the raised error message (at build time) will give a list of compatible models to be chosen from Huggingface.

Note that the DOCX summaries are saved in the same folder as the original DOCX file and are named <docx_file_name>__word_summary.txt. When building the site, as long as a file respecting the naming convention is found in the folder, the summary for <docx_file_name>.docx will be not generated again. Feel free to apply any manual correction to the summary file for removing not relevant paragraphs and/or model hallucinations. Be aware that, after each manual correction, the site must be built again. If is needed to force re-generating the summary, just delete the <docx_file_name>__word_summary.txt and build the site again.

DOCX 1

The next example demonstrates a generated summary without any subsequent manual correction. Observe that the quality of the summary is much better than the summary generated for the pdf version of the same document. Except for a small correction in the last paragraph (sometimes the model doesn’t interpret well the abbreviations), the docx summary do not require heavy manual corrections.

{% include elements/docx-summary.html 
    file="_experiments/docx-summary/pr.docx"
    btnType="danger"
    btnOutline="false"
    btnText="Download"
    sBorder="true"
    sh="300px" 
%}

START DOCX SUMMARY

pr.docx

END DOCX SUMMARY

PDF 1

DOCX 2

The quality of the summary depends very much on the way in which the original document is structured and written. The next example demonstrates a new generated summary, without any additional correction, but applied on a document having a different structure and generating a better quality text. Note that, for this structure of docx, the quality of docx summary is similar to the quality of the pdf summary (maybe small corrections like removing sections titles numbering must be applied here).

{% include elements/docx-summary.html 
    file="_experiments/docx-summary/pt.docx"
    btnType="danger"
    btnOutline="false"
    btnText="Download"
    sBorder="true"
    sh="300px" 
%}

START DOCX SUMMARY

pt.docx

END DOCX SUMMARY

PDF 2

DOCX 3

The next example demonstrates a new generated summary, without any additional correction, but applied on a document having a different structure. Note that, for this structure of docx, the summary requires some manual corrections, but the quality is still better than the one of the pdf document and the sections were identified automatically while in the pdf summary the sections titles were added/modified with manual correction.

{% include elements/docx-summary.html 
    file="_experiments/docx-summary/pe.docx"
    btnType="danger"
    btnOutline="false"
    btnText="Download"
    sBorder="true"
    sh="300px" 
%}

START DOCX SUMMARY

pe.docx

END DOCX SUMMARY

PDF 3

Corrections

As described, corrections can be manually applied to the generated summaries for removing not relevant paragraphs or model hallucinations. The summarisation algorithm is set to generate rather detailed summaries to allow choosing the relevant parts for rendering to the document. However, docx summarisation is usually more structured than pdf ones because structure extraction from a pdf is looser than in docx where it can be hooked to the headings from the docx. On the other hand, pdf summaries may contain more relevant details.

Limitations

Tables, images and sections like Table of Contents, List of Tables, List of Figures are usually excluded from summarisation. To the best extent possible, annotation, comments, references, bibliography, citations and similar sections are also excluded.

Parameters

  • file: path to the DOCX file provided as relative path from the root of doc-contents folder
  • btnType: type of the download button, default value is primary. See Downloads.
  • btnOutline: type of the outline of the download button, default value is false. See Downloads.
  • btnText: text on the download button, default value is Download. See Downloads. Note that there is not automatic translation of this label. Since it can be set as parameter, is needed to be manually adapted to the site language.
  • sBorder: the DOCX summary block has or not a thin left border, default value is false
  • sh: the DOCX summary block has or not a fixed height, default value is auto.

Comments
Title : pageTitle
Reference : anchor