Paper title: |
Hybrid Construction Document Classification Model using Machine Learning (ML) and Text Segmentation Methodology |
Authors: |
T Mahfouz, |
Summary: |
The dynamic nature of the construction industry yields enormous documents that are generated in an unstructured format like technical specifications, meeting minutes, daily reports, claims, and construction litigation cases. With the increasing level of sophistication and growing speed of the industry, the efficient use of these documents became inevitably needed. This paper proposes a hybrid automated construction document classifier utilizing Machine Learning (ML) and Text Segmentation. The current research builds on previous study performed by the author that utilized Support Vector Machines (SVM) for automating construction document classification. To that end, the current paper presents the enhanced results of performing a pre-processing step of text segmentation of construction documents. Lengthy construction documents like claims typically address different topics or different aspects of the same topic within one document. This issue decreases the accuracy of the SVM classifiers. Consequently, the pre-processing step aims at defining texts that are related to different topic within the same document. The adopted research methodology (1) gathered and utilized a corpus of 500 Different Site Conditions (DSC) cases from the Federal Court of New York; (2) developed a tokenizing and parsing algorithm for the used documents through C++; (3) implemented text segmentation adopted from Hearst’s TextTiling algorithm; (4) developed SVM automated classification models; and (5) compared the outputs to results attained in previous works. The outcomes of this research are expected to enhance automated decision support tools developed for the construction industry. |
Type: |
conference paper |
Year of publication: |
2011 |
Keywords: |
Document Classification, Text Segmentation, Machine Learning (ML), TextTiling, Support Vector Machines (SVM) |
Series: |
w78:2011 |
ISSN: |
2706-6568 |
Download paper: |
/pdfs/w78-2011-Paper-33.pdf |
Citation: |
T Mahfouz, (2011).
Hybrid Construction Document Classification Model using Machine Learning (ML) and Text Segmentation Methodology. Proceedings of the 28th International Conference of CIB W78, Sophia Antipolis, France, 26-28 October (ISSN: 2706-6568),
http://itc.scix.net/paper/w78-2011-Paper-33
|