Technical terms are important for knowledge mining, especially in the bio-medical area where vast amount of documents are available. The amount of terms (e.g., names of genes, proteins, chemical compounds, drugs, organisms, etc) is increasing at an astounding rate in the bio-medical literature. Existing terminological resources and scientific databases cannot keep up-to-date with the growth of neologisms. A domain independent method for term recognition is very useful to automatically recognize terms from documents. The TerMine demonstrator intergrates C-Value multiword term extraction andAcroMineacronym recognition.
C-value is a domain-independent method for automatic term recognition (ATR) which combines linguistic and statistical analyses, emphasis being placed on the statistical part. The linguistic analysis enumerates all candidate terms in a given text by applying part-of-speech tagging, extracting word sequences of adjectives/nouns based, and stop-list. The statistical analysis assigns a termhood to a candidate term by using the following four characteristics:
the occurrence frequency of the candidate term
the frequency of the candidate term as part of other longer candidate terms
the number of these longer candidate terms
the length of the candidate term
http://www.nactem.ac.uk/software/termine/