Supervised document classification based upon domain-specific term taxonomies
My article "Supervised document classification based upon domain-specific term taxonomies", co-authored with Matteo Cristani, has been accepted for pubblication on the International Journal of Metadata, Semantics and Ontologies (IJMSO).
The classification of documents is an interesting topic of recent terminological investigations, in particular the technological ones. Some sophisticated techniques have been developed which provide the classification based upon the recognition of specific linguistic features, such as specific terms or occurrences of phrases. A limited number of cases exist of real document classification applications that make use of natural language processing techniques providing both statistical analysis and human supervision, where the system fully automates the classification process, but the instruction of the taxonomy is a totally human-centered activity. In this paper we focus on an application with the above mentioned features; we then introduce a methodology that makes use of this application. The fundamental argument in favor of a specific methodology is that the analysis which brings to the deployment of a term taxonomy can be seen as a ontology construction: we also discuss this aspect as a general motivation.





Some Rights Reserved

1 Comments:
TermExtractor is online! It's a FREE and high-performing tool for terminology extraction.
TermExtractor, my master thesis, is online at the
address http://lcl2.di.uniroma1.it.
TermExtractor is a FREE and high-performing software
package for Terminology Extraction and a very useful starting-point for Ontology Construction.
The software helps a web community to
extract and validate relevant domain terms in their
interest domain, by submitting an archive of
domain-related documents in any format
(txt, pdf, ps, dvi, tex, doc, rtf, ppt, xls, xml,
html/htm, chm, wpd and also zip archives.)
TermExtractor extracts terminology consensually
referred in a specific application domain. The
software takes as input a corpus of domain documents,
parses the documents, and extracts a list of
"syntactically plausible" terms (e.g. compounds,
adjective-nouns, etc.).
Documents parsing assigns a greater importance
to terms with text layouts (title, abstract, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string.
NEW: Now TermExtractor allows to a group of users to
validate an extracted terminology. See the news at
http://lcl2.di.uniroma1.it/termextractor/news.jsp
--
Francesco Sclano
home page: http://lcl2.di.uniroma1.it/~sclano
msn: francesco_sclano@yahoo.it
skype: francesco978
Post a Comment
<< Home