Background In the realm of knee pathology, magnetic resonance
imaging (MRI) has the advantage of visualising all structures within the knee
joint, which makes it a valuable tool for increasing diagnostic accuracy and
planning surgical treatments. Therefore, clinical narratives found in MRI
reports convey valuable diagnostic information. A range of studies have proven
the feasibility of natural language processing for information extraction from
clinical narratives. However, no study focused specifically on MRI reports in
relation to knee pathology, possibly due to the complexity of knee anatomy and
a wide range of conditions that may be associated with different anatomical
entities. In this paper we describe KneeTex, an information extraction system
that operates in this domain.
Methods As an ontology-driven information extraction system,
KneeTex makes active use of an ontology to strongly guide and constrain text
analysis. We used automatic term recognition to facilitate the development of
a domain-specific ontology with sufficient detail and coverage for text mining
applications. In combination with the ontology, high regularity of the
sublanguage used in knee MRI reports allowed us to model its processing by a
set of sophisticated lexico-semantic rules with minimal syntactic analysis.
The main processing steps involve named entity recognition combined with
coordination, enumeration, ambiguity and co-reference resolution, followed by
text segmentation. Ontology-based semantic typing is then used to drive the
template filling process.
Results We adopted an existing ontology, TRAK (Taxonomy for
RehAbilitation of Knee conditions), for use within KneeTex. The original TRAK
ontology expanded from 1,292 concepts, 1,720 synonyms and 518 relationship
instances to 1,621 concepts, 2,550 synonyms and 560 relationship instances.
This provided KneeTex with a very fine-grained lexico-semantic knowledge base,
which is highly attuned to the given sublanguage. Information extraction
results were evaluated on a test set of 100 MRI reports. A gold standard
consisted of 1,259 filled template records with the following slots: finding,
finding qualifier, negation, certainty, anatomy and anatomy qualifier. KneeTex
extracted information with precision of 98.00%, recall of 97.63% and F-measure
of 97.81%, the values of which are in line with human-like performance.
KneeTex is an open-source, stand-alone application for information extraction
from narrative reports that describe an MRI scan of the knee. Given an MRI
report as input, the system outputs the corresponding clinical findings in the
onto TRAK, an ontology that formally models knowledge relevant for the
rehabilitation of knee conditions. As a result, formally structured and coded
information allows for complex searches to be conducted efficiently over the
original MRI reports, thereby effectively supporting epidemiologic studies of
In this paper we investigate the role of idioms in automated approaches to sentiment
analysis. To estimate the degree to which the inclusion of idioms as features may
potentially improve the results of traditional sentiment analysis, we compared our
results to two such methods. First, to support idioms as features we collected a
set of 580 idioms that are relevant to sentiment analysis, i.e. the ones that can be
mapped to an emotion. These mappings were then obtained using a web-based crowdsourcing
approach. The quality of the crowdsourced information is demonstrated with high agreement
among five independent annotators calculated using Krippendorff's alpha coefficient
(α = 0.662). Second, to evaluate the results of sentiment analysis, we assembled
a corpus of sentences in which idioms are used in context. Each sentence was annotated
with an emotion, which formed the basis for the gold standard used for the comparison
against two baseline methods. The performance was evaluated in terms of three measures
- precision, recall and F-measure. Overall, our approach achieved 64% and 61% for these
three measures in two experiments improving the baseline results by 20 and 15 percent
points respectively. F-measure was significantly improved over all three sentiment
polarity classes: Positive, Negative and Other. Most notable improvement was recorded
in classification of positive sentiments, where recall was improved by 45 percent points
in both experiments without compromising the precision. The statistical significance of
these improvements was confirmed by McNemar's test.
Irena Spasić, Kate Button, Anna Divoli, Satyam Gupta, Tamas Pataky,
Diego Pizzocaro, Alun Preece, Robert van Deursen and Chris Wilson (2015)
TRAK application suite: A web-based intervention for delivering standard care for the
rehabilitation of knee conditions.JMIR Research Protocols,
Vol. 4, No. 4, e122
Background: Standard care for the rehabilitation of knee conditions
involves exercise program and information provision. Current methods of rehabilitation
delivery struggle to keep up with large volumes of patients and the length of treatment
required to maximize the recovery. Therefore, the development of novel interventions
to support self-management is strongly recommended. Such interventions need to include
information provision, goal setting, monitoring, feedback and support groups, but the
most effective methods of their delivery are poorly understood. Internet provides a
medium for intervention delivery with considerable potential for meeting these needs.
Objective: The primary aim of this study was to demonstrate feasibility
of a web-based application and to conduct a preliminary review of its practicability as
part of a complex medical intervention in the rehabilitation of knee disorders. This
paper describes the development, implementation and usability of such an application.
Methods: The TRAK application suite was developed by an interdisciplinary
team of healthcare professionals and researchers, computer scientists and application
developers. The key functionality of the application includes information provision,
a three-step exercise program based on a standard care for the rehabilitation of knee
conditions, self-monitoring with visual feedback and virtual support group. Two types
of stakeholders (patients and physiotherapists) were recruited for the usability study.
The usability questionnaire was used to collect both qualitative and quantitative
information on computer and Internet usage, task completion, and subjective user
Results: A total of 16 patients and 15 physiotherapists participated in
the usability study. Based on the System Usability Scale, the TRAK application has higher
perceived usability than 70% of systems. Both patients and physiotherapists agreed that
the given web-based approach would facilitate communication, provide information, help
recall information, improve understanding, enable exercise progression and support
self-management in general. The web application was found to be easy to use and user
satisfaction was very high. The TRAK application suite can be accessed at
Conclusion: The usability study suggests that a web-based intervention
is feasible and acceptable in supporting self-management of knee conditions.
Keywords: Internet; social media; web applications; mobile applications;
usability testing; knee; rehabilitation; exercise; self-management
Kate Button, Paulien Roos, Irena Spasić, Paul Adamson and Robert van Deursen (2015)
The clinical effectiveness of self-care interventions with an exercise component to manage knee conditions:
A systematic review.The Knee, in press
Objective: Treatment for musculoskeletal knee conditions should include
techniques to support self-management and exercise based interventions but the most
beneficial techniques and effective way to combine self-care and exercise are unknown.
Therefore the aim was to evaluate the clinical effectiveness of self-care programs that
include an exercise component for knee musculoskeletal conditions.
Methods: A keyword search of Medline, Cinahl, Amed, Psychinfo, Web
of Science and Cochrane databases was conducted up until July 2014. Two reviewers
independently assessed manuscript eligibility against the inclusion/exclusion
criteria. Study quality was assessed using Downs and Black quality assessment tool
and Cochrane Risk of Bias tool. Date was extracted about self-care and exercise
intervention type, control intervention, participants, length of follow-up, outcome
measures and main findings.
Results: From 7392 studies identified through the keyword search the
title and abstract was screened for 5498 studies. The full text manuscripts of 106
articles were retrieved to evaluate their eligibility. Twenty one manuscripts met
the inclusion/exclusion criteria.
Conclusion: The treatment potential of combining self-care and
exercise interventions has not been maximised through limitations in study design
and failure to adequately define intervention content. Potentially the most
beneficial self-care treatment components are training self-management skills,
information delivery and goal setting. Exercise treatment components could be
strengthened by better attention to dose and progression. Technology should be
considered to streamline delivery, as high levels of supervision are not required.
More emphasis is required on using self-care and exercise programs for chronic
condition prevention in addition to chronic condition management.
Purpose: This paper reviews the research literature on text mining (TM)
with the aim to find out (1) which cancer domains have been the subject of TM efforts,
(2) which knowledge resources can support TM of cancer-related information and (3) to
what extent systems that rely on knowledge and computational methods can convert text
data into useful clinical information. These questions were used to determine the current
state of the art in this particular strand of TM and suggest future directions in TM
development to support cancer research.
Methods: A review of the research on TM of cancer-related information
was carried out. A literature search was conducted on the Medline database as well as
IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such
research. The search results were supplemented with the literature identified through
Results: A range of studies have proven the feasibility of TM for
extracting structured information from clinical narratives such as those found in
pathology or radiology reports. In this article, we provide a critical overview of
the current state of the art for TM related to cancer. The review highlighted a
strong bias towards symbolic methods, e.g. named entity recognition (NER) based on
dictionary lookup and information extraction (IE) relying on pattern matching. The
F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is
in the high 90s. To further improve the performance, TM approaches need to deal
effectively with idiosyncrasies of the clinical sublanguage such as non-standard
abbreviations as well as a high degree of spelling and grammatical errors. This
requires a shift from rule-based methods to machine learning following the success
of similar trends in biological applications of TM. Machine learning approaches
require large training datasets, but clinical narratives are not readily available
for TM research due to privacy and confidentiality concerns. This issue remains
the main bottleneck for progress in this area. In addition, there is a need for a
comprehensive cancer ontology that would enable semantic representation of textual
information found in narrative reports.
Keywords: Cancer, Natural language processing, Data mining, Electronic medical records
Christian Bannister, Craig Currie, Alun Preece and Irena Spasić (2014)
Automatic development of clinical prediction models with genetic programming:
A case study in cardiovascular disease.Value in Health,
Vol. 17, No. 3, pp. A200-A201
Simon Moore, Claire O'Brien, Mohammed Fasihul Alam, David Cohen, Kerenza Hood, Chao Huang, Laurence Moore, Simon Murphy, Rebecca Playle, Vaseekaran Sivarajasingam, Irena Spasić, Anne Williams and Jonathan Shepherd
All-Wales licensed premises intervention (AWLPI): a randomised controlled trial to reduce alcohol-related violence.
BMC Public Health, Vol. 14, 21
Background Alcohol-related violence in and in the vicinity of
licensed premises continues to place a considerable burden on the United Kingdom's
(UK) health services. Robust interventions targeted at licensed premises are
therefore required to reduce the costs of alcohol-related harm. Previous
evaluations of interventions in licensed premises have a number of methodological
limitations and none have been conducted in the UK. The aim of the trial was to
determine the effectiveness of the Safety Management in Licensed Environments
intervention designed to reduce alcohol-related violence in licensed premises,
delivered by Environmental Health Officers, under their statutory authority to
intervene in cases of violence in the workplace.
Methods A national randomised controlled trial, with licensed
premises as the unit of allocation. Premises were identified from all 22 Local
Authorities in Wales. Eligible premises were those with identifiable violent
incidents on premises, using police recorded violence data. Premises were
allocated to intervention or control by optimally balancing by Environmental
Health Officer capacity in each Local Authority, number of violent incidents
in the 12 months leading up to the start of the project and opening hours.
The primary outcome measure is the difference in frequency of violence between
intervention and control premises over a 12 month follow-up period, based on a
recurrent event model. The trial incorporates an embedded process evaluation to
assess intervention implementation, fidelity, reach and reception, and to
interpret outcome effects, as well as investigate its economic impact.
Discussion The results of the trial will be applicable to all
statutory authorities directly involved with managing violence in the night time
economy and will provide the first formal test of Health and Safety policy in
this environment. If successful, opportunities for replication and generalisation
will be considered.
Christian Bannister, Chris Poole, Sara Jenkins-Jones, Christopher Morgan, Glyn Elwyn, Irena Spasić and Craig Currie
External validation of the UKPDS risk engine in incident type 2 diabetes: a need for new type 2 diabetes-specific risk equations.
Diabetes Care, Vol. 37, No. 2, pp. 537-545
Objective To evaluate the performance of the United Kingdom
Prospective Diabetes Study (UKPDS) Risk Engine for predicting the 10-year risk
of cardiovascular disease endpoints in an independent cohort of UK patients
newly diagnosed with type 2 diabetes.
Research Design and Methods This was a retrospective cohort
study using routine healthcare data collected between April 1998 and October
2011 from around 350 UK primary-care practices contributing to the Clinical
Practice Research Datalink (CPRD). Participants comprised 79,966 patients
aged between 35 and 85 years (388 269 person years) with 4,984 cardiovascular
events. Four outcomes were evaluated: first diagnosis of coronary heart disease
(CHD), stroke, fatal CHD, and fatal stroke.
Results Accounting for censoring, the observed versus predicted
ten-year event rates were as follows: CHD 6.1% vs 16.5%, fatal CHD 1.9% vs 10.1%,
stroke 7.0% vs 10.1%, and fatal stroke 1.7% vs 1.6%, respectively. The UKPDS-RE
showed moderate discrimination for all four outcomes, with the concordance-index
values ranging from 0.65 to 0.78.
Conclusions The UKPDS stroke equations showed calibration
ranging from poor to moderate; however, the CHD equations showed poor
calibration and considerably overestimated CHD risk. There is a need for
revised risk equations in type 2 diabetes.
Background The increasing amount of textual information in
biomedicine requires effective term recognition methods to identify textual
representations of domain-specific concepts as the first step toward
automating its semantic interpretation. The dictionary look-up approaches
may not always be suitable for dynamic domains such as biomedicine or the
newly emerging types of media such as patient blogs, the main obstacles being
the use of non-standardised terminology and high degree of term variation.
Results In this paper, we describe FlexiTerm, a method
for automatic term recognition from a domain-specific corpus, and evaluate
its performance against five manually annotated corpora. FlexiTerm performs
term recognition in two steps: linguistic filtering is used to select term
candidates followed by calculation of termhood, a frequency-based measure
used as evidence to qualify a candidate as a term. In order to improve the
quality of termhood calculation, which may be affected by the term variation
phenomena, FlexiTerm uses a range of methods to neutralise the main sources
of variation in biomedical terms. It manages syntactic variation by processing
candidates using a bag-of-words approach. Orthographic and morphological
variations are dealt with using stemming in combination with lexical and
phonetic similarity measures. The method was evaluated on five biomedical
corpora. The highest values for precision (94.56%), recall (71.31%) and
F-measure (81.31%) were achieved on a corpus of clinical notes.
FlexiTerm is an open-source software tool for automatic term recognition.
It incorporates a simple term variant normalisation method. The method
proved to be more robust than the baseline against less formally structured
texts, such as those found in patient blogs or medical notes. The software
can be downloaded freely at
In this paper we discuss the design and development of TRAK (Taxonomy for RehAbilitation of Knee conditions), an ontology that formally models information relevant for the rehabilitation of knee conditions. TRAK provides the framework that can be used to collect coded data in sufficient detail to support epidemiologic studies so that the most effective treatment components can be identified, new interventions developed and the quality of future randomized control trials improved to incorporate a control intervention that is well defined and reflects clinical practice. TRAK follows design principles recommended by the Open Biomedical Ontologies (OBO) Foundry. TRAK uses the Basic Formal Ontology (BFO) as the upper-level ontology and refers to other relevant ontologies such as Information Artifact Ontology (IAO), Ontology for General Medical Science (OGMS), Phenotype And Trait Ontology (PATO), etc. TRAK is orthogonal to other bio-ontologies and represents domain-specific knowledge about treatments and modalities used in rehabilitation of knee conditions. Definitions of typical exercises used as treatment modalities are supported with appropriate illustrations, which can be viewed in the OBO-Edit ontology editor. The vast majority of other classes in TRAK are cross-referenced to the Unified Medical Language System (UMLS) to facilitate future integration with other terminological sources. TRAK is implemented in OBO, a format widely used by the OBO community. TRAK is available for download from http://www.cs.cf.ac.uk/trak. In addition, its public release can be accessed through BioPortal, where it can be browsed, searched and visualized.
Kieran Smallbone, Hanan Messiha, Kathleen Carroll, Catherine Winder, Naglis Malys,
Warwick Dunn, Ettore Murabito, Neil Swainston, Joseph Dada, Farid Khan, Pinar Pir,
Evangelos Simeonidis, Irena Spasić, Jill Wishart, Dieter Weichart,
Neil Hayes, Daniel Jameson, David Broomhead, Stephen Oliver, Simon Gaskell,
John McCarthy, Norman Paton, Hans Westerhoff, Douglas Kell and Pedro Mendes
A model of yeast glycolysis based on a consistent kinetic characterisation of all its enzymes.
Vol. 587, No. 17, pp. 2832-2841
We present an experimental and computational pipeline for the generation of kinetic models
of metabolism, and demonstrate its application to glycolysis in Saccharomyces cerevisiae.
Starting from an approximate mathematical model, we employ a "cycle of knowledge"
strategy, identifying the steps with most control over flux. Kinetic parameters of the
individual isoenzymes within these steps are measured experimentally under a standardised
set of conditions. Experimental strategies are applied to establish a set of in vivo
concentrations for isoenzymes and metabolites. The data are integrated into a mathematical
model that is used to predict a new set of metabolite concentrations and reevaluate the control
properties of the system. This bottom-up modelling study reveals that control over the
metabolic network most directly involved in yeast glycolysis is more widely distributed than
Keywords: glycolysis, systems biology, enzyme kinetics, isoenzymes, modelling
The authors present a system developed for the 2011 i2b2 Challenge on Sentiment Classification, whose aim was to automatically classify sentences in suicide notes using a scheme of 15 topics, mostly emotions. The system combines machine learning with a rule-based methodology. The features used to represent a problem were based on lexico-semantic properties of individual words in addition to regular expressions used to represent patterns of word usage across different topics. A naive Bayes classifier was trained using the features extracted from the training data consisting of 600 manually annotated suicide notes. Classification was then performed using the naive Bayes classifier as well as a set of pattern-matching rules. The classification performance was evaluated against a manually prepared gold standard consisting of 300 suicide notes, in which 1,091out of a total of 2,037 sentences were associated with a total of 1,272 annotations. The competing systems were ranked using the micro-averaged F-measure as the primary evaluation metric. Our system achieved the F-measure of 53% (with 55% precision and 52% recall), which was significantly better than the average performance of 48.75% achieved by the 26 participating teams.
Peter Li, Joseph Dada, Daniel Jameson, Irena Spasić,
Neil Swainston, Kathleen Carroll, Warwick Dunn, Farid Khan, Hanan Messiha,
Evangelos Simeonidis, Dieter Weichart, Catherine Winder, David Broomhead,
Carole Goble, Simon Gaskell, Douglas Kell, Hans Westerhoff, Pedro Mendes and Norman Paton
Systematic integration of experimental data and models in systems biology.
BMC Bioinformatics, Vol. 11, 582
Background: The behaviour of biological systems can be deduced from their mathematical models. However, multiple sources of data in diverse forms are required in the construction of a model in order to define its components and their biochemical reactions, and corresponding parameters. Automating the assembly and use of systems biology models is dependent upon data integration processes involving the interoperation of data and analytical resources.
Taverna workflows have been developed for the automated assembly of quantitative parameterised metabolic networks in the Systems Biology Markup Language (SBML). A SBML model is built in a systematic fashion by the workflows which starts with the construction of a qualitative network using data from a MIRIAM-compliant genome-scale model of yeast metabolism. This is followed by parameterisation of the SBML model with experimental data from two repositories, the SABIO-RK enzyme kinetics database and a database of quantitative experimental results. The models are then calibrated and simulated in workflows that call out to COPASIWS, the web service interface to the COPASI software application for analysing biochemical networks. These systems biology workflows were evaluated for their ability to construct a parameterised model of yeast glycolysis.
Distributed information about metabolic reactions that have been described to MIRIAM standards enables the automated assembly of quantitative systems biology models of metabolic networks based on user-defined criteria. Such data integration processes can be implemented as Taverna workflows to provide a rapid overview of the components and their relationships within a biochemical system.
Objective: We present a system developed for the 2009 i2b2 Challenge in Natural Language Processing for Clinical Data, whose aim was to automatically extract certain information about medications used by a patient from his/her medical report. The aim was to extract the following information for each medication: name, dosage, mode/route, frequency, duration, and reason.
Design: The system implements a rule-based methodology, which exploits typical morphological, lexical, syntactic and semantic features of the targeted information. These features were acquired from the training dataset and public resources such as the UMLS and relevant web pages. Information extracted via pattern matching was combined together using context-sensitive heuristic rules.
Measurements: The system was applied to a set of 547 previously unseen discharge summaries, and the extracted information was evaluated against a manually prepared gold standard consisting of 251 documents. The overall ranking of the participating teams was obtained using the micro-averaged F-measure as the primary evaluation metric.
Results: The implemented method achieved the micro-averaged F-measure of 81% (with 86% precision and 77% recall), which ranked our system third in the challenge. The significance tests revealed our system's performance to be not significantly different from that of the second ranked system. Relative to other systems, our system achieved the best F-measure for the extraction of duration (53%) and reason (46%).
Conclusion: Based on the F-measure, the performance achieved (81%) was in line with the initial agreement between human annotators (82%), indicating that such system may greatly facilitate the process of extracting relevant information from medical records by providing a solid basis for a manual review process.
Motivation: Research in systems biology is carried out
through a combination of experiments and models. Several data standards
have been adopted for representing models (SBML) and various types of
relevant experimental data (such as FuGE and those of the Proteomics
Standards Initiative). However, until now, there has been no standard
way to associate a model and its entities to the corresponding data
sets, or vice versa. Such a standard would provide a means to represent
computational simulation results as well as to frame experimental data
in the context of a particular model. Target applications include
model-driven data analysis, parameter estimation, and sharing and
archiving model simulations.
Results: We propose the Systems Biology Results Markup
Language (SBRML), an XML-based language which associates a model with
several data sets. Each data set is represented as a series of values
associated with model variables, and their corresponding parameter
values. SBRML provides a flexible way of indexing the results to
model parameter values, which supports both spreadsheet-like data
and multidimensional data cubes. We present and discuss several
examples of SBRML usage in applications such as enzyme kinetics,
microarray gene expression, and various types of simulation results.
Norman Paton and
KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways.Bioinformatics,
Vol. 25, No. 11, pp. 1404-1411
Most experimental evidence on kinetic parameters is buried in the literature,
whose manual searching is complex, time-consuming and partial. These shortcomings
become particularly acute in systems biology, where these parameters need to be
integrated into detailed, genome-scale, metabolic models. These problems are
addressed by KiPar, a dedicated information retrieval sys-tem designed to
facilitate access to the literature relevant for kinetic modelling of a given
metabolic pathway in yeast. Searching for kinetic data in the context of an
individual pathway offers modularity as a way of tackling the complexity of
developing a full metabolic model. It is also suitable for large-scale mining,
since multiple reactions and their kinetic parameters can be specified in a
single search request, rather than one reaction at a time, which is unsuitable
given the size of genome-scale models.
We developed an integrative approach, combining public data and software resources
for the rapid development of large-scale text mining tools targeting complex
biological information. The user supplies input in the form of identifiers used
in relevant data resources to refer to the concepts of interest, e.g. EC numbers,
GO and SBO identifiers. By doing so, the user is freed from providing any other
knowledge or terminology concerned with these concepts and their relations, since
they are retrieved from these and cross-referenced resources automatically. The
terminology acquired is used to index the literature by mapping concepts to their
synonyms, and then to textual documents mentioning them. The indexing results and
the previously acquired knowledge about relations between concepts are used to
formulate complex search queries aiming at documents relevant to the user's
information needs. The conceptual approach is demonstrated in the implementation
of KiPar. Evaluation reveals that KiPar performs better than a Boolean
search. The precision achieved for abstracts (60%) and full-text articles (48%)
is considerably better than the baseline precision (44% and 24% respectively).
The baseline recall is improved by 36% for abstracts and by 100% for full text.
It appears that full-text articles are a much richer source of information on
kinetic data than are their abstracts. Finally, the combined results for
abstracts and full text compared to curated literature provide high values for
relative recall (88%) and novelty ratio (92%), suggesting that the system is
able to retrieve a high proportion of new documents.
We present a system developed for the Challenge in Natural Language Processing
for Clinical Data - the i2b2 obesity challenge, whose aim was to automatically
identify the status of obesity and 15 related co-morbidities in patients using
their clinical discharge summaries. The challenge consisted of two tasks,
textual and intuitive. The textual task was to identify explicit references to
the diseases, whereas the intuitive task focused on the prediction of the disease
status when the evidence was not explicitly asserted. We assembled a set of
resources to lexically and semantically profile the diseases and their associated
symptoms, treatments, etc. These features were explored in a hybrid text mining
approach, which combined dictionary look-up, rule-based and machine-learning
methods. The implemented method achieved the macro-averaged F-measure of 81% for
the textual task (which was the highest achieved in the challenge) and 63% for
the intuitive task (ranked 7th out of 28 teams - the highest was 66%). The
micro-averaged F-measure showed an average accuracy of 97% for textual and 96%
for intuitive annotations. The performance achieved was in line with the agreement
between human annotators, indicating the potential of text mining for accurate
and efficient prediction of disease statuses from clinical discharge summaries.
Marie Brown, Warwick Dunn, Paul Dobson, Yogendra Patel, Cate Winder, Sue Francis-McIntyre,
Paul Begley, Kathleen Carroll, David Broadhurst, Andy Tseng, Neil Swainston,
Royston Goodacre and
Mass spectrometry tools and metabolite-specific databases for molecular identification in metabolomics.The Analyst, Vol. 134, No. 7, pp. 1322-1332
The chemical identification of mass spectrometric signals in metabolomic applications
is important to provide conversion of analytical data to biological knowledge about
metabolic pathways. The complexity of electrospray mass spectrometric data acquired
from a range of samples (serum, urine, yeast intracellular extracts, yeast metabolic
footprints, placental tissue metabolic footprints) has been investigated and has
defined the frequency of different ion types routinely detected. Although some ion
types were expected (protonated and deprotonated peaks, isotope peaks, multiply charged
peaks) others were not expected (sodium formate adduct ions). In parallel, the Manchester
Metabolomics Database (MMD) has been constructed with data from genome scale metabolic
reconstructions, HMDB, KEGG, Lipid Maps, BioCyc and DrugBank to provide knowledge on 42,687
endogenous and exogenous metabolite species. The combination of accurate mass data for a
large collection of metabolites, theoretical isotope abundance data and knowledge of the
different ion types detected provided a greater number of electrospray mass spectrometric
signals which were putatively identified and with greater confidence in the samples studied.
To provide definitive identification metabolite-specific mass spectral libraries for UPLC-MS
and GC-MS have been constructed for 1,065 commercially available authentic standards. The
MMD data are available at http://dbkgroup.org/MMD/.
Markus J. Herrgård, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalçin Arga, Mikko Arvas,
Nils Blüthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Peter Li,
Wolfram Liebermeister, Monica L. Mo, Ana Paula Oliveira, Dina Petranović, Stephen Pettifer,
Evangelos Simeonidis, Kieran Smallbone, Irena Spasić, Dieter Weichart,
Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betül Kirdar, Merja Penttilä, Edda Klipp,
Bernhard Ø. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen and Douglas B. Kell (2008)
A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology.
Nature Biotechnology, Vol. 26, No. 10, pp. 1155-1160
Genomic data allow the large-scale manual or semi-automated assembly of
metabolic network reconstructions, which provide highly curated organism-specific
knowledge bases. Although several genome-scale network reconstructions describe
Saccharomyces cerevisiae metabolism, they differ in scope and content,
and use different terminologies to describe the same chemical entities. This makes
comparisons between them difficult and underscores the desirability of a consolidated
metabolic network that collects and formalizes the 'community knowledge' of yeast
metabolism. We describe how we have produced a consensus metabolic network
reconstruction for S. cerevisiae. In drafting it, we placed special
emphasis on referencing molecules to persistent databases or using
database-independent forms, such as SMILES or InChI strings, as this
permits their chemical structure to be represented unambiguously and
in a manner that permits automated reasoning. The reconstruction is
readily available via a publicly accessible database and in the
Systems Biology Markup Language
It can be maintained as a resource that serves as a common denominator for
studying the systems biology of yeast. Similar strategies should benefit
communities studying genome-scale metabolic networks of other organisms.
Douglas Kell and
Facilitating the development of controlled vocabularies for metabolomics technologies with text mining.
BMC Bioinformatics, Vol. 9, Suppl. 5, S5
Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently
interpret and seamlessly integrate information scattered across public resources. Experimental
data sets from metabolomics studies need to be integrated with one another, but also with data
produced by other types of omics studies in the spirit of systems biology, hence the pressing
need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non
trivial to construct these resources manually.
We describe a methodology for rapid development of controlled vocabularies, a study originally
motivated by the needs for vocabularies describing metabolomics technologies. We present case
studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and
gas chromatography) whose development is currently underway as part of the Metabolomics Standards
Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms.
A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis
of the results showed that full-text articles (especially the Materials and Methods sections) are
the major source of technology-specific terms as opposed to paper abstracts.
We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly
expanding a set of controlled vocabularies with the terms used in the scientific literature. We
adopted an integrative approach, combining relatively generic software and data resources for
time- and cost-effective development of a text mining tool for expansion of controlled vocabularies
across various domains, as a practical alternative to both manual term collection and tailor-made
named entity recognition methods.
Andrew Tseng and
A GC-TOF-MS study of the stability of serum and urine metabolomes during the UK Biobank sample collection and preparation protocols.
International Journal of Epidemiology, Vol. 37, pp. i23-i30
Background: The stability of mammalian serum and urine in large metabolomic investigations is essential for accurate, valid and reproducible studies. The stability of mammalian serum and urine, either processed immediately by freezing at -80°C or stored at 4°C for 24 hours before being frozen, was compared in a pilot metabolomic study of samples from 40 separate healthy volunteers.
Methods: Metabolic profiling with GC-TOF-MS was performed for serum and urine samples collected from 40 volunteers and stored at -80°C or 4°C for 24 hours before being frozen. Subsequent Kruskal-Wallis and Principal Components Analysis methods were used to assess whether metabolomic differences were detected between samples stored at 4°C for 0 or 24 hours.
Results: More than 700 unique metabolite peaks were detected, with over 200 metabolite peaks detected in any one sample. Principal Components Analysis (PCA) of serum and urine data showed that the variance associated with the replicate analysis per sample (analytical variance) was of the same magnitude as the variance observed between samples stored at 4°C or -80°C for 24 hours (biological variance). From a functional point of view the metabolomic composition of samples did not change in a statistically significant manner when stored under the two different conditions.
Conclusions: Based on this small pilot study, the UK Biobank sampling, transport and fractionation protocols are considered suitable to provide samples which can produce scientifically robust and valid data in metabolomic studies.
David Broadhurst, Sasalu Deepak, Mamta Buch, Garry McDowell,
Douglas Kell and
Serum metabolomics reveals many novel metabolic markers of heart failure, including pseudouridine and 2-oxoglutarate.
Metabolomics, Vol. 3, No. 4, pp. 413-426
There is intense interest in the identification of novel biomarkers which
improve the diagnosis of heart failure. Serum samples from 52 patients with s
ystolic heart failure (EF<40% plus signs and symptoms of failure) and 57
controls were analyzed by gas chromatography - time of flight - mass
spectrometry and the raw data reduced to 272 statistically robust
metabolite peaks. 38 peaks showed a significant difference between
case and control (p<5×10-5). Two such metabolites
were pseudouridine, a modified nucleotide present in t- and rRNA and a
marker of cell turnover, as well as the tricarboxylic acid cycle intermediate
2-oxoglutarate. Furthermore, three further compounds were also excellent
discriminators between patients and controls: 2-hydroxy, 2-methylpropanoic
acid, erythritol and 2,4,6-trihydroxypyrimidine. These findings demonstrate
the power of data-driven metabolomics approaches to identify such markers
Mark Viant and
the Ontology Working Group Members
Metabolomics Standards Initiative - Ontology Working Group: Work in progress.
Metabolomics, Vol. 3, No. 3, pp. 249-256
In this article we present the activities of the Ontology Working Group (OWG)
under the Metabolomics Standards Initiative (MSI) umbrella. Our endeavour aims
to synergise the work of several communities, where independent activities are
underway to develop terminologies and databases for metabolomics investigations.
We have joined forces to rise to the challenges associated with interpreting and
integrating experimental process and data across disparate sources (software and
databases, private and public). Our focus is to support the activities of the
other MSI working groups by developing a common semantic framework to enable
metabolomics-user communities to consistently annotate the experimental process
and to enable meaningful exchange of datasets. Our work is accessible via a public
webpage and a draft ontology has been posted under the Open Biological Ontology
umbrella. At the very outset, we have agreed to minimize duplications across
omics domains through extensive liaisons with other communities under the OBO
Foundry. This is work in progress and we welcome new participants willing to
volunteer their time and expertise to this open effort.
Stephen Oliver and
MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics.
BMC Bioinformatics, Vol. 7, 281
The genome sequencing projects have shown our limited knowledge regarding gene function, e.g. S. cerevisiae has 5-6,000 genes of which nearly 1,000 have an uncertain function. Their gross influence on the behaviour of the cell can be observed using large-scale metabolomic studies. The metabolomic data produced need to be structured and annotated in a machine-usable form to facilitate the exploration of the hidden links between the genes and their functions.
MeMo is a formal model for representing metabolomic data and the associated metadata. Two predominant platforms (SQL and XML) are used to encode the model. MeMo has been implemented as a relational database using a hybrid approach combining the advantages of the two technologies. It represents a practical solution for handling the sheer volume and complexity of the metabolomic data effectively and efficiently. The MeMo model and the associated software are available at http://dbkgroup.org/memo/.
The maturity of relational database technology is used to support efficient data processing. The scalability and self-descriptiveness of XML are used to simplify the relational schema and facilitate the extensibility of the model necessitated by the creation of new experimental techniques. Special consideration is given to data integration issues as part of the systems biology agenda. MeMo has been physically integrated and cross-linked to related metabolomic and genomic databases. Semantic integration with other relevant databases has been supported through ontological annotation. Compatibility with other data formats is supported by automatic conversion.
A report on the third Genomes to Systems consortium conference, which portrayed
the breadth of the post-genome sciences including Genomics, Transcriptomics,
Proteomics, Metabolomics, Informatics, and integrative Systems Biology.
The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill information, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. In this article, we summarize different approaches in which ontologies have been used for text mining applications in biomedicine.
One element of classical systems analysis treats a system as a black or grey box, the inner structure and behaviour of which can be analysed and modelled by varying an internal or external condition, probing it from outside and studying the effect of the variation on the external observables. The result is an understanding of the inner make-up and workings of the system. The equivalent of this in biology is to observe what a cell or system excretes under controlled conditions - the 'metabolic footprint' or exometabolome - as this is readily and accurately measurable. Here, we review the principles, experimental approaches and scientific outcomes that have been obtained with this useful and convenient strategy.
Motivation: The sheer volume of textually described biomedical knowledge exerts the need for natural language processing (NLP) applications in order to allow flexible and efficient access to relevant information. Specialised semantic networks (such as biomedical ontologies, terminologies or semantic lexicons) can significantly enhance these applications by supplying the necessary terminological information in a machinereadable form. Due to the explosive growth of bio-literature, new terms (representing newly identified concepts or variations of the existing terms) may not be explicitly described within the network and hence cannot be fully exploited by NLP applications. Linguistic and statistical clues can be used to extract many new terms from free text. The extracted terms still need to be correctly positioned relative to other terms in the network. Classification as a means of semantic typing represents the first step in updating a semantic network with new terms.
Results: The MaSTerClass system implements the case-based reasoning methodology for the classification of biomedical terms.
Availability: MaSTerClass is available at http://www.cbr-masterclass.org. It is distributed under an open source licence for educational and research purposes. The software requires Java, JWSDP, Ant, MySQL and X-hive to be installed and licences obtained separately where needed.
Irena Spasić and
A Metabolome pipeline: from concept to data to knowledge.
Metabolomics, Vol. 1, No. 1, pp. 39-51
Metabolomics, like others omics methods, produces huge datasets of biological variables, along with the necessary metadata. However, regardless of the form in which these are produced they are merely the ground substance for assisting us in answering biological questions. In this short tutorial review and position paper we seek to set out some of the elements of 'best practice' in the optimal acquisition of such data, and in the means by which they may be turned into reliable knowledge. Many of these steps involve the solution of what amount to combinatorial optimization problems, and methods developed for these are, especially those based on evolutionary computing, are proving valuable. This is done in terms of a 'pipeline' that goes from the design of good experiments, through instrumental optimization, data storage and manipulation, the chemometric data processing methods in common use, and the necessary means of validation and cross-validation for giving conclusions that are credible and likely to be robust when applied in comparable circumstances and to samples not used in their generation.
In this paper, we present an approach to term classification based on
verb selectional patterns (VSPs), where such a pattern is defined as a
set of semantic classes that could be used in combination with a given
domain-specific verb. VSPs have been automatically learnt based on the
information found in a corpus and an ontology in the biomedical domain.
Prior to the learning phase, the corpus is terminologically processed:
term recognition is performed by both looking up the dictionary of terms
listed in the ontology and applying the C/NC-value method for on-the-fly
term extraction. Subsequently, domain-specific verbs are automatically
identified in the corpus based on the frequency of occurrence and the
frequency of their co-occurrence with terms. VSPs are then learnt
automatically for these verbs. Two machine learning approaches are
presented. The first approach has been implemented as an iterative
generalisation procedure based on a partial order relation induced by
the domain-specific ontology. The second approach exploits the idea of
genetic algorithms. Once the VSPs are acquired, they can be used to
classify newly recognised terms co-occurring with domain-specific verbs.
Given a term, the most frequently co-occurring domain-specific verb is
selected. Its VSP is used to constrain the search space by focusing on
potential classes of the given term. A nearest-neighbour approach is then
applied to select a class from the constrained space of candidate classes.
The most similar candidate class is predicted for the given term. The
similarity measure used for this purpose combines contextual, lexical,
and syntactic properties of terms.
In this article we present an approach to the automatic discovery of term similarities, which may serve as a basis for a number of term-oriented knowledge mining tasks. The method for term comparison combines internal (lexical similarity) and two types of external criteria (syntactic and contextual similarities). Lexical similarity is based on sharing lexical constituents (i.e. term heads and modifiers). Syntactic similarity relies on a set of specific lexico-syntactic co-occurrence patterns indicating the parallel usage of terms (e.g. within an enumeration or within a term coordination/conjunction structure), while contextual similarity is based on the usage of terms in similar contexts. Such contexts are automatically identified by a pattern mining approach, and a procedure is proposed to assess their domain-specific and terminological relevance. Although automatically collected, these patterns are domain dependent and identify contexts in which terms are used. Different types of similarities are combined into a hybrid similarity measure, which can be tuned for a specific domain by learning optimal weights for individual similarities. The suggested similarity measure has been tested in the domain of biomedicine, and some experiments are presented.
In this paper we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts are presented.
In this paper we describe TIMS, an integrated knowledge management system for the domain of molecular biology and biomedicine, in which terminology-driven literature mining, knowledge acquisition, knowledge integration, and XML-based knowledge retrieval are combined using tag information management and ontology inference. The system integrates automatic terminology acquisition, term variation management, hierarchical term clustering, tag-based information extraction, and ontology-based query expansion. TIMS supports introducing and combining different types of tags (linguistic and domain-specific, manual and automatic). Tag-based interval operations and a query language are introduced in order to facilitate knowledge acquisition and retrieval from XML documents. Through knowledge acquisition examples, we illustrate the way in which literature mining techniques can be utilised for knowledge discovery from documents.
Refereed book chapters:
Irena Spasić and Douglas Kell (2013)
Searching for kinetic parameters with KiPar.
In W. Dubitzky, O. Wolkenhauer, H. Yokota and K.-H. Cho (Eds.):
Encyclopedia of Systems Biology,
Definition KiPar is an Information Retrieval system
designed to facilitate access to the literature relevant for kinetic
modeling of metabolic pathways in yeast. Information supplied as user
input includes the enzymes catalyzing the reactions of interest and
the parameters whose values are required for kinetic modeling. The
output is produced as a list of documents (either abstracts from
PubMed or full-text articles from PubMed Central) that should contain
the required values of kinetic parameters. There are two groups of
users of this specific application: (1) experimentalists who wish to
compare experimentally estimated values of kinetic parameters to those
reported in the literature, and (2) mathematical modelers who wish to
incorporate known values of kinetic parameters into metabolic models.
Neil Swainston, Daniel Jameson, Peter Li, Irena Spasić, Pedro Mendes and Norman Paton (2010)
Integrative information management for systems biology. In P. Lambrix and G. Kemp (Eds.):
Data Integration in the Life Sciences, LNCS 6254,
Springer, pp. 164-178
Systems biology develops mathematical models of biological systems
that seek to explain, or better still predict, how the system behaves.
In bottom-up systems biology, systematic quantitative experimentation
is carried out to obtain the data required to parameterize models,
which can then be analyzed and simulated. This paper describes an
approach to integrated information management that supports bottom-up
systems biology, with a view to automating, or at least minimizing the
manual effort required during, creation of quantitative models from
qualitative models and experimental data. Automating the process makes
model construction more systematic, supports good practice at all stages
in the pipeline, and allows timely integration of high throughput
experimental results into models.
Keywords: computational systems biology, workflow
Marco Masseroli, Norman Paton and Irena Spasić (2010)
Search computing and the life sciences.
In S. Ceri and M. Brambilla (Eds.): Search Computing Challenges and Directions,
Springer, pp. 291-306
Search Computing has been proposed to support the integration of the results
of search engines with other data and computational resources. A key feature
of the resulting integration platform is direct support for multi-domain
ordered data, reflecting the fact that search engines produce ranked outputs,
which should be taken into account when the results of several requests are
combined. In the life sciences, there are many different types of ranked data.
For example, ranked data may represent many different phenomena, including
physical ordering within a genome, algorithmically assigned scores that
represent levels of sequence similarity, and experimentally measured values
such as expression levels. This chapter explores the extent to which the
search computing functionalities designed for use with search engine results
may be applicable for different forms of ranked data that are encountered
when carrying out data integration in the life sciences. This is done by
classifying different types of ranked data in the life sciences, providing
examples of different types of ranking and ranking integration needs in the
life sciences, identifying issues in the integration of such ranked data,
and discussing techniques for drawing conclusions from diverse rankings.
Keywords: search computing, bioinformatics, data integration, ranked data
Goran Nenadić and Irena Spasić
(2008) Towards automatic terminology extraction in Serbian.
In G. Zybatow et al. (Eds.): Formal Description of Slavic Languages, Peter Lang, Frankfurt/Main
In this article we discuss an automatic approach used to facilitate terminological
processing of texts in Serbian. The approach combines linguistic knowledge (term formation
patterns) with corpus-based statistical measures. Generic morpho-syntactic filters are used to
extract individual term candidates. The filters encode information on grammatical agreements
used as clues to discover boundaries of term candidates in a free text and to determine their
inner structure (nested terms). The extracted candidates are further assorted into sets that
unify their inflectional and lexical variants, and are subsequently assigned termhoods (i.e.
likelihood to represent actual terms). We present preliminary results of the terminological
processing of a textbook corpus in the domains of mathematics and computer science.
We present a measure of contextual similarity for biomedical terms.
The contextual features need to be explored, because newly coined
terms are not explicitly described and efficiently stored in
biomedical ontologies and their inner features (e.g. morphologic
or orthographic) do not always provide sufficient information
about the properties of the underlying concepts. The context of
each term can be represented as a sequence of syntactic elements
annotated with biomedical information retrieved from an ontology.
The sequences of contextual elements may be matched approximately
by edit distance defined as the minimal cost incurred by the
changes (including insertion, deletion and replacement) needed to
transform one sequence into the other. Our approach augments the
traditional concept of edit distance by elements of linguistic and
biomedical knowledge, which together provide flexible selection of
contextual features and their comparison.
Irena Spasić and
Mining biomedical abstracts: what is in a term?.
In K.Y. Su et al. (Eds.): Natural Language Processing - IJCNLP 2004.
LNAI 3248, Springer, pp. 797-806
In this paper we present a study of the usage of terminology in biomedical literature, with the main aim to indicate phenomena that can be helpful for automatic term recognition in the domain. Our comparative analysis is based on the terminology used in the Genia corpus. We analyse the usage of ordinary biomedical terms as well as their variants (namely inflectional and orthographic alternatives, terms with prepositions, coordinated terms, etc.), showing the variability and dynamic nature of terms used in biomedical abstracts. Term coordination and terms containing prepositions are analysed in detail. We show that there is a discrepancy between terms used in literature and terms listed in controlled dictionaries. We also evaluate the effectiveness of incorporating different types of term variation into an automatic term recognition system.
Goran Nenadić and
Learning to classify biomedical terms through literature mining and genetic algorithms.
In Z.R. Yang et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2004.
LNCS 3177, Springer, pp. 345-351
We present an approach to classification of biomedical terms based on the information acquired automatically from the corpus of relevant literature. The learning phase consists of two stages: acquisition of terminologically relevant contextual patterns (CPs) and selection of classes that apply to terms used with these patterns. CPs represent a generalisation of similar term contexts in the form of regular expressions containing lexical, syntactic and terminological information. The most probable classes for the training terms co-occurring with the statistically relevant CP are learned by a genetic algorithm. Term classification is based on the learnt results. First, each term is associated with the most frequently co-occurring CP. Classes attached to such CP are initially suggested as the term's potential classes. Then, the term is finally mapped to the most similar suggested class.
Irena Spasić and
Reducing lexical ambiguity in Serbo-Croatian by using genetic algorithms.
In P. Kosta et al. (Eds.): Investigations into Formal Slavic Linguistics.
Linguistik International, Peter Lang, Frankfurt, pp. 287-298
This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora using genetic algorithms. The main aim is to use these constraints to automatically define local grammars that can be used to reduce lexical ambiguity usually found in an initially tagged text. A genetic algorithm for computation of the minimal representation of grammatical features of textual constituents is suggested. The algorithm incorporates two types of genes, dominant and recessive, which are specific for the features that are analysed. The resulting genetic structure describes the constraints that have to be fulfilled in order to form a correct utterance. As a case study, the suggested algorithm is applied on contexts of prepositional phrases, and features of corresponding noun phrases are obtained. The results obtained coincide with (theoretical) grammars that define the constraints for such noun phrases.
Kostas Manios and
Supervised learning of term similarities.
In Hujun Yin et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2002.
LNCS 2412, Springer, pp. 429-434
In this paper we present a method for the automatic discovery
and tuning of term similarities. The method is based on the
automatic extraction of significant patterns in which terms
tend to appear. Beside that, we use lexical and functional
similarities between terms to define a hybrid similarity
measure as a linear combination of the three similarities.
We then present a genetic algorithm approach to supervised
learning of parameters that are used in this linear combination.
We used a domain specific ontology to evaluate the generated
similarity measures and set the direction of their convergence.
The approach has been tested and evaluated in the domain of
Irena Spasić and
Term clustering using a corpus-based similarity measure.
In P. Sojka et al. (Eds.): Text, Speech and Dialogue - TSD 2002.
LNAI 2448, Springer, pp. 151-154
In this paper we present a method for the automatic term clustering.
The method uses a hybrid similarity measure to cluster terms
automatically extracted from a corpus by applying the C/NC-value
method. The measure comprises contextual, functional and lexical
similarity, and it is used to instantiate the cell values in a
similarity matrix. The clustering algorithm uses either the nearest
neighbour or the Ward's method to calculate the distance between
clusters. The approach has been tested and evaluated in the domain
of molecular biology and the results are presented.
Irena Spasić and
Syntactic structures in a sublanguage of Serbian for querying relational databases.
In G. Zybatow et al. (Eds.): Current Issues in Formal Slavic Linguistics.
Peter Lang, Frankfurt/Main, pp. 478-488
This paper deals with syntactic structures identified in a sublanguage of Serbian for querying relational databases. Three levels of syntactic description of the sublanguage are defined: word, syntagmatic, and sentence levels. An algorithm for complete syntactic analysis of a Serbian language query over relational database and its translation into a formal SQL query is presented. An example of partial parsing and translation is discussed.
Goran Nenadić and
The recognition and acquisition of compound names from corpora.
In D. Christodoulakis (Ed.): Natural Language Processing - NLP 2000.
LNAI 1835, Springer, pp.38-48
In this paper we will present an approach to acquisition of some classes
of compound words from large corpora, as well as a method for semi-automatic
generation of appropriate linguistic models, that can be further used for
compound word recognition and for completion of compound word dictionaries.
The approach is intended for a highly inflective language such as Serbo-Croatian.
Generated linguistic models are represented by local grammars.
Goran Nenadić and
The acquisition of some lexical constraints from corpora.
In V. Matousek et al. (Eds.): Text, Speech and Dialogue - TSD 1999.
LNAI 1692, Springer, pp. 115-120
This paper presents an approach to acquisition of some lexical and
grammatical constraints from large corpora. Constraints that are
discussed are related to grammatical features of a preposition and
the corresponding noun phrase that constitute a prepositional phrase.
The approach is based on the extraction of a textual environment of
a preposition from a corpus, which is then tagged using the system
of electronic dictionaries. An algorithm for computation of some
kind of the minimal representation of grammatical features associated
with the corresponding noun phrases is suggested. The resulting set of
features describes the constraints that a noun phrase has to fulfil in
order to form a correct prepositional phrase with a given preposition.
This set can be checked against other corpora.
Refereed conference papers:
Kieran Evans, Andrew Jones, Alun Preece, Francisco Quevedo, David Rogers, Irena Spasić, Ian Taylor,
Vlado Stankovski, Salman Taherizadeh, Jernej Trnkoczy, George Suciu, Victor Suciu, Paul Martin, Junchao Wang and Zhiming Zhao
Dynamically reconfigurable workflows for time-critical applications,
in Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science,
Austin, Texas, USA
Lakhveer Bhachu, Larisa Soldatova, Irena Spasić and Kate Button
Mobile application KneeCare to support knee rehabilitation,
Science and Information Conference,
Access control and privacy policies change during the course of collaboration.
Information is often shared with collaborators outside of the traditional "perimeterized"
organizational computer network. At this point the information owner (in the legal data
protection sense) loses persistent control over their information. They cannot modify
the policy that controls who accesses it, and have that enforced on the information
wherever it resides. However, if patient consent is withdrawn or if the collaboration
comes to an end naturally, or prematurely, the owner may be required to withdraw further
access to their information. This paper presents a system that enhances the way access
control technology is currently deployed so that information owners retain control of
their access control and privacy policies, even after information has been shared.
Bioinformatics applications heavily rely on controlled vocabularies and ontologies to consistently interpret and seamlessly integrate information scattered across disparate public resources. Experimental data from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. Here we describe the development of controlled vocabularies for metabolomics investigations. Manual term acquisition approaches are time-consuming, labour-intensive and error-prone, especially in a rapidly developing domain such as metabolomics, where new analytical techniques emerge regularly so that the domain experts are often compelled to use non-standardised terms. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature.
In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.
In this paper we present an approach to term classification based on verb complementation patterns. The complementation patterns have been automatically learnt by combining information found in a corpus and an ontology, both belonging to the biomedical domain. The learning process is unsupervised and has been implemented as an iterative reasoning procedure based on a partial order relation induced by the domain-specific ontology. First, term recognition was performed by both looking up the dictionary of terms listed in the ontology and applying the C/NC-value method. Subsequently, domain-specific verbs were automatically identified in the corpus. Finally, the classes of terms typically selected as arguments for the considered verbs were induced from the corpus and the ontology. This information was used to classify newly recognised terms. The precision of the classification method reached 64%.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Morpho-syntactic clues for terminological processing in Serbian, in Proceedings of EACL Workshop on Morphological Processing of Slavic Languages, Budapest, Hungary, pp. 79-86
In this paper we discuss morpho-syntactic clues that can be used to facilitate terminological processing in Serbian. A method (called srCe) for automatic extraction of multiword terms is presented. The approach incorporates a set of generic morpho-syntactic filters for recognition of term candidates, a method for conflation of morphological variants and a module for foreign word recognition. Morpho-syntactic filters describe general term formation patterns, and are implemented as generic regular expressions. The inner structure together with the agreements within term candidates are used as clues to discover the boundaries of nested terms. The results of the terminological processing of a textbook corpus in the domains of mathematics and computer science are presented.
In this paper we describe the X-TRACT workbench, which enables efficient term-based querying against a domain-specific literature corpus. Its main aim is to aid domain specialists in locating and extracting new knowledge from scientific literature corpora. Before querying, a corpus is automatically terminologically analysed by the ATRACT system, which performs terminology recognition based on the C/NC-value method enhanced by incorporation of term variation handling. The results of terminology processing are annotated in XML, and the produced XML documents are stored in an XML-native database. All corpus retrieval operations are performed against this database using an XML query language. We illustrate the way in which the X-TRACT workbench can be utilised for knowledge discovery, literature mining and conceptual information extraction.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Terminology-driven mining of biomedical literature, in Proceedings of 18th Annual ACM Symposium on Applied Computing, Melbourne, Florida, USA
Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective literature mining techniques that can help biologists to gather and make use of the knowledge encoded in text documents. Although the knowledge is organised around sets of domain-specific terms, few literature mining systems incorporate deep and dynamic terminology processing.
Results: In this paper, we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts recorded the precision of 98% and 71% respectively.
Term recognition and clustering are key topics in automatic knowledge acquisition and text mining. In this paper we present a novel approach to the automatic discovery of term similarities, which serves as a basis for both classification and clustering of domain-specific concepts represented by terms. The method is based on automatic extraction of significant patterns in which terms tend to appear. The approach is domain independent: it needs no manual description of domain-specific features and it is based on knowledge-poor processing of specific term features. However, automatically collected patterns are domain specific and identify significant contexts in which terms are used. Beside features that represent contextual patterns, we use lexical and functional similarities between terms to define a combined similarity measure. The approach has been tested and evaluated in the domain of molecular biology, and preliminary results are presented.
Sophia Ananiadou, Goran Nenadić, Dietrich Schuhmann and Irena Spasić (2002) Term-based literature mining from biomedical texts, ISMB Text Data Mining SIG, Edmonton, Canada
Goran Nenadić and Sophia Ananiadou
Tuning context features with genetic algorithms, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2048-2054
In this paper we present an approach to tuning of context features acquired from corpora. The approach is based on the idea of a genetic algorithm (GA). We analyse a whole population of contexts surrounding related linguistic entities in order to find a generic property characteristic of such contexts. Our goal is to tune the context properties so as not to lose any correct feature values, but also to minimise the presence of ambiguous values. The GA implements a crossover operator based on dominant and recessive genes, where a gene corresponds to a context feature. A dominant gene is the one that, when combined with another gene of the same type, is inevitably reflected in the offspring. Dominant genes denote the more suitable context features. In each iteration of the GA, the number of individuals in the population is halved, finally resulting in a single individual that contains context features tuned with respect to the information contained in the training corpus. We illustrate the general method by using a case study concerned with the identification of relationships between verbs and terms complementing them. More precisely, we tune the classes of terms that are typically selected as arguments for the considered verbs in order to acquire their semantic features.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Automatic acronym acquisition and management within domain-specific texts, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2155-2162
In this paper we present a framework for the effective management of terms and their variants that are automatically acquired from domain-specific texts. In our approach, the term variant recognition is incorporated in the automatic term retrieval process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in scientific papers. We describe a method for the automatic acquisition of newly introduced acronyms and the mapping to their 'meanings', i.e. the corresponding terms. The proposed three-step procedure is based on morpho-syntactic constraints that are commonly used in acronym definitions. First, acronym definitions containing an acronym and the corresponding term are retrieved. These two elements are matched in the second step by performing morphological analysis of words and combining forms constituting the term. The problems of acronym variation and acronym ambiguity are addressed in the third step by establishing classes of term variants that correspond to specific concepts. We present the results of the acronym acquisition in the domain of molecular biology: the precision of the method ranged from 94% to 99% depending on the size of the corpus used for evaluation, whilst the recall was 73%.
Goran Nenadić and Sophia Ananiadou
A genetic algorithm approach to unsupervised learning of context features, in Proceedings of 5th National Colloquium for Computational Linguistics in the UK, Leeds, UK, pp. 12-19
We present an approach to unsupervised learning of some context features from corpora. The approach uses the idea of genetic algorithms. The algorithm operates on collection of related linguistic entities as opposed to an isolated linguistic entity. Each of the entities encodes the values for predefined set of context features obtained by automatic tagging. Our goal is to refine these features in order to find an interpretation that is optimal in the sense that it does not lose any correct feature values, but which, on the other hand, minimises the presence of feature values that are not applicable in a specific context. Our genetic algorithm implements a novel crossover operator based on two types of genes, dominant and recessive, where a gene corresponds to a context feature.
Dubravka Pavličić and Irena Spasić (2001) The effects of irrelevant alternatives on the results of the TOPSIS method, in Proceedings of XXVIII Yugoslav Symposium
on Operational Research SYM-OP-IS 2001, Belgrade, Serbia
Irena Spasić and Gordana Pavlović-Lažetić
Object-oriented modelling in natural language communication with a relational database, in Selected Papers from 10th Congress of Yugoslav Mathematicians, Belgrade, Serbia, pp. 343-347
This paper describes the problems of developing a natural language
interface towards a relational database (RDB). These problems depend on
a particular database, or, more precisely, on a specific semantic domain
that is modeled by the RDB. The most obvious dependency is the one reflected
in the structure of the RDB, that is - the actual tables, attributes and
their relationships. This information is recorded in the RDB catalogue,
which can be used for the automatic generation of an OO model of the RDB.
The classes of that model may serve the purpose of supporting the information
extracted from a natural language query (NLQ). Possible ambiguities are
gradually reduced by using the IsA relationships between the classes. If this
still leaves the ambiguity unresolved, then it is possible to automatically
generate a menu corresponding to the class that is the source of the ambiguity.
The structure of the menu is in accordance with the OO model of the RDB.
Olgica Bošković and Irena Spasić (1999) Graph theory and log-linear models, in Proceedings of XXVI Yugoslav Symposium on Operational Research SYM-OP-IS '99, Belgrade, Serbia
Irena Spasić (1996) Automatic foreign words recognition in a Serbian scientific or technical text, in Proceedings of Conference on Standardization of Terminology, Serbian Academy of Arts and Sciences, Belgrade, Serbia
Irena Spasić and Predrag Janičić (2000) Theory of Algorithms, Languages and Automata. Faculty of Mathematics, Belgrade, Serbia
Miodrag Ivović, Branislav Boričić, Dragan Azdejković and Irena Spasić (1998) Practice Book in Mathematics. Faculty of Economics, Belgrade, Serbia
Miodrag Ivović, Branislav Boričić, Velimir Pavlović, Dragan Azdejković and Irena Spasić (1996) Mathematics through Examples and Exercises with Elements of Theory. Faculty of Economics, Belgrade, Serbia