Digital library of construction informatics
and information technology in civil engineering and construction


Paper w78-2011-Paper-33:
Hybrid Construction Document Classification Model using Machine Learning (ML) and Text Segmentation Methodology

Facilitated by the SciX project

T Mahfouz,

Hybrid Construction Document Classification Model using Machine Learning (ML) and Text Segmentation Methodology

Abstract: The dynamic nature of the construction industry yields enormous documents that are generated in an unstructured format like technical specifications, meeting minutes, daily reports, claims, and construction litigation cases. With the increasing level of sophistication and growing speed of the industry, the efficient use of these documents became inevitably needed. This paper proposes a hybrid automated construction document classifier utilizing Machine Learning (ML) and Text Segmentation. The current research builds on previous study performed by the author that utilized Support Vector Machines (SVM) for automating construction document classification. To that end, the current paper presents the enhanced results of performing a pre-processing step of text segmentation of construction documents. Lengthy construction documents like claims typically address different topics or different aspects of the same topic within one document. This issue decreases the accuracy of the SVM classifiers. Consequently, the pre-processing step aims at defining texts that are related to different topic within the same document. The adopted research methodology (1) gathered and utilized a corpus of 500 Different Site Conditions (DSC) cases from the Federal Court of New York; (2) developed a tokenizing and parsing algorithm for the used documents through C++; (3) implemented text segmentation adopted from Hearstís TextTiling algorithm; (4) developed SVM automated classification models; and (5) compared the outputs to results attained in previous works. The outcomes of this research are expected to enhance automated decision support tools developed for the construction industry.

Keywords: Document Classification, Text Segmentation, Machine Learning (ML), TextTiling, Support Vector Machines (SVM)


Full text: content.pdf (245,284 bytes) (available to registered users only)

Series: w78:2011 (browse)
Similar papers:
Sound: N/A.


hosted by University of Ljubljana



© itc.scix.net 2003
FIRST PREVIOUS NEXT LAST Home page of this database login Powered by SciX Open Publishing Services 1.002 February 16, 2003