The development of taxonomies/ontologies is a human intensive process requiring prohibitively large resource commitments in terms of time and cost. In our previous work we have identified an experimentation framework for semi-automatic taxonomy/hierarchy generation from unstructured text. In the preliminary results presented, the taxonomy/hierarchy quality was lower than we had anticipated. In this paper, we present two variations of our experimentation framework, viz. Latent semantic Indexing (LSI) for document indexing and the use of term vectors to prune labels assigned to nodes in the final taxonomy/hierarchy. Using our previous results of taxonomy/hierarchy quality as the baseline we present results that demonstrate significant improvement in taxonomy/hierarchy label quality resulting from the above and present insights into the reason for the same. Finally, we present a discussion on methods for further improving taxonomy/hierarchy quality.
Available at: http://works.bepress.com/amit_sheth/182/
University of Georgia, Athens, Computer Science Department, UGA-CS-TR-04-006