• KSII Transactions on Internet and Information Systems
    Monthly Online Journal (eISSN: 1976-7277)

A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document


Abstract

A styled document is a document that contains diverse decorating functions such as different font, colors, tables and images generally authored in a word processor (e.g., MS-WORD, Open Office). Compared to a plain-text document, a styled document enables a human to easily recognize a logical structure such as section, subsection and contents of a document. However, it is difficult for a computer to recognize the structure if a writer does not explicitly specify a type of an element by using the styling functions of a word processor. It is one of the obstacles to enhance document version management systems because they currently manage the document with a file as a unit, not the document elements as a management unit. This paper proposes a machine learning based approach to analyzing the logical structure of a styled document composing of sections, subsections and contents. We first suggest a feature vector for characterizing document elements from a styled document, composing of eight features such as font size, indentation and period, each of which is a frequently discovered item in a styled document. Then, we trained machine learning classifiers such as Random Forest and Support Vector Machine using the suggested feature vector. The trained classifiers are used to automatically identify logical structure of a styled document. Our experiment obtained 92.78% of precision and 94.02% of recall for analyzing the logical structure of 50 styled documents.


Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from December 1st, 2015)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article

[IEEE Style]
T. Kim, S. Kim, S. Choi, J. Kim, J. Choi, J. Ko, J. Lee and Y. Cho, "A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document," KSII Transactions on Internet and Information Systems, vol. 11, no. 2, pp. 1043-1056, 2017. DOI: 10.3837/tiis.2017.02.023.

[ACM Style]
Tae-young Kim, Suntae Kim, Sangchul Choi, Jeong-Ah Kim, Jae-Young Choi, Jong-Won Ko, Jee-Huong Lee, and Youngwha Cho. 2017. A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document. KSII Transactions on Internet and Information Systems, 11, 2, (2017), 1043-1056. DOI: 10.3837/tiis.2017.02.023.

[BibTeX Style]
@article{tiis:21370, title="A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document", author="Tae-young Kim and Suntae Kim and Sangchul Choi and Jeong-Ah Kim and Jae-Young Choi and Jong-Won Ko and Jee-Huong Lee and Youngwha Cho and ", journal="KSII Transactions on Internet and Information Systems", DOI={10.3837/tiis.2017.02.023}, volume={11}, number={2}, year="2017", month={February}, pages={1043-1056}}