Prof. Irena Spasić

eXTReMe Tracker


Journal articles:

Irena Spasić, Bo Zhao, Christopher Jones, Kate Button (2015) KneeTex: An ontology-driven system for information extraction from MRI reports. Journal of Biomedical Semantics, Vol. 6, 34 [PMID: 26347806] [DOI: 10.1186/s13326-015-0033-1] [Web: KneeTex]
Background In the realm of knee pathology, magnetic resonance imaging (MRI) has the advantage of visualising all structures within the knee joint, which makes it a valuable tool for increasing diagnostic accuracy and planning surgical treatments. Therefore, clinical narratives found in MRI reports convey valuable diagnostic information. A range of studies have proven the feasibility of natural language processing for information extraction from clinical narratives. However, no study focused specifically on MRI reports in relation to knee pathology, possibly due to the complexity of knee anatomy and a wide range of conditions that may be associated with different anatomical entities. In this paper we describe KneeTex, an information extraction system that operates in this domain.

Methods As an ontology-driven information extraction system, KneeTex makes active use of an ontology to strongly guide and constrain text analysis. We used automatic term recognition to facilitate the development of a domain-specific ontology with sufficient detail and coverage for text mining applications. In combination with the ontology, high regularity of the sublanguage used in knee MRI reports allowed us to model its processing by a set of sophisticated lexico-semantic rules with minimal syntactic analysis. The main processing steps involve named entity recognition combined with coordination, enumeration, ambiguity and co-reference resolution, followed by text segmentation. Ontology-based semantic typing is then used to drive the template filling process.

Results We adopted an existing ontology, TRAK (Taxonomy for RehAbilitation of Knee conditions), for use within KneeTex. The original TRAK ontology expanded from 1,292 concepts, 1,720 synonyms and 518 relationship instances to 1,621 concepts, 2,550 synonyms and 560 relationship instances. This provided KneeTex with a very fine-grained lexico-semantic knowledge base, which is highly attuned to the given sublanguage. Information extraction results were evaluated on a test set of 100 MRI reports. A gold standard consisted of 1,259 filled template records with the following slots: finding, finding qualifier, negation, certainty, anatomy and anatomy qualifier. KneeTex extracted information with precision of 98.00%, recall of 97.63% and F-measure of 97.81%, the values of which are in line with human-like performance.

Conclusions KneeTex is an open-source, stand-alone application for information extraction from narrative reports that describe an MRI scan of the knee. Given an MRI report as input, the system outputs the corresponding clinical findings in the form of JavaScript Object Notation objects. The extracted information is mapped onto TRAK, an ontology that formally models knowledge relevant for the rehabilitation of knee conditions. As a result, formally structured and coded information allows for complex searches to be conducted efficiently over the original MRI reports, thereby effectively supporting epidemiologic studies of knee conditions.
Lowri Williams, Christian Bannister, Michael Arribas-Ayllon, Alun Preece and Irena Spasić (2015) The role of idioms in sentiment analysis. Expert Systems with Applications, Vol. 42, No. 21, pp. 7375-7385 [DOI: 10.1016/j.eswa.2015.05.039]
In this paper we investigate the role of idioms in automated approaches to sentiment analysis. To estimate the degree to which the inclusion of idioms as features may potentially improve the results of traditional sentiment analysis, we compared our results to two such methods. First, to support idioms as features we collected a set of 580 idioms that are relevant to sentiment analysis, i.e. the ones that can be mapped to an emotion. These mappings were then obtained using a web-based crowdsourcing approach. The quality of the crowdsourced information is demonstrated with high agreement among five independent annotators calculated using Krippendorff's alpha coefficient (α = 0.662). Second, to evaluate the results of sentiment analysis, we assembled a corpus of sentences in which idioms are used in context. Each sentence was annotated with an emotion, which formed the basis for the gold standard used for the comparison against two baseline methods. The performance was evaluated in terms of three measures - precision, recall and F-measure. Overall, our approach achieved 64% and 61% for these three measures in two experiments improving the baseline results by 20 and 15 percent points respectively. F-measure was significantly improved over all three sentiment polarity classes: Positive, Negative and Other. Most notable improvement was recorded in classification of positive sentiments, where recall was improved by 45 percent points in both experiments without compromising the precision. The statistical significance of these improvements was confirmed by McNemar's test.

Keywords: emotion recognition, sentiment analysis, natural language processing, user-generated content, tagging
Irena Spasić, Kate Button, Anna Divoli, Satyam Gupta, Tamas Pataky, Diego Pizzocaro, Alun Preece, Robert van Deursen and Chris Wilson (2015) TRAK application suite: A web-based intervention for delivering standard care for the rehabilitation of knee conditions. JMIR Research Protocols, Vol. 4, No. 4, e122 [PMID: 26474643] [DOI: 10.2196/resprot.4091]
Background: Standard care for the rehabilitation of knee conditions involves exercise program and information provision. Current methods of rehabilitation delivery struggle to keep up with large volumes of patients and the length of treatment required to maximize the recovery. Therefore, the development of novel interventions to support self-management is strongly recommended. Such interventions need to include information provision, goal setting, monitoring, feedback and support groups, but the most effective methods of their delivery are poorly understood. Internet provides a medium for intervention delivery with considerable potential for meeting these needs.

Objective: The primary aim of this study was to demonstrate feasibility of a web-based application and to conduct a preliminary review of its practicability as part of a complex medical intervention in the rehabilitation of knee disorders. This paper describes the development, implementation and usability of such an application.

Methods: The TRAK application suite was developed by an interdisciplinary team of healthcare professionals and researchers, computer scientists and application developers. The key functionality of the application includes information provision, a three-step exercise program based on a standard care for the rehabilitation of knee conditions, self-monitoring with visual feedback and virtual support group. Two types of stakeholders (patients and physiotherapists) were recruited for the usability study. The usability questionnaire was used to collect both qualitative and quantitative information on computer and Internet usage, task completion, and subjective user preferences.

Results: A total of 16 patients and 15 physiotherapists participated in the usability study. Based on the System Usability Scale, the TRAK application has higher perceived usability than 70% of systems. Both patients and physiotherapists agreed that the given web-based approach would facilitate communication, provide information, help recall information, improve understanding, enable exercise progression and support self-management in general. The web application was found to be easy to use and user satisfaction was very high. The TRAK application suite can be accessed at

Conclusion: The usability study suggests that a web-based intervention is feasible and acceptable in supporting self-management of knee conditions.

Keywords: Internet; social media; web applications; mobile applications; usability testing; knee; rehabilitation; exercise; self-management
Kate Button, Paulien Roos, Irena Spasić, Paul Adamson and Robert van Deursen (2015) The clinical effectiveness of self-care interventions with an exercise component to manage knee conditions: A systematic review. The Knee, in press [PMID: 26056046] [DOI: 10.1016/j.knee.2015.05.003]
Objective: Treatment for musculoskeletal knee conditions should include techniques to support self-management and exercise based interventions but the most beneficial techniques and effective way to combine self-care and exercise are unknown. Therefore the aim was to evaluate the clinical effectiveness of self-care programs that include an exercise component for knee musculoskeletal conditions.

Methods: A keyword search of Medline, Cinahl, Amed, Psychinfo, Web of Science and Cochrane databases was conducted up until July 2014. Two reviewers independently assessed manuscript eligibility against the inclusion/exclusion criteria. Study quality was assessed using Downs and Black quality assessment tool and Cochrane Risk of Bias tool. Date was extracted about self-care and exercise intervention type, control intervention, participants, length of follow-up, outcome measures and main findings.

Results: From 7392 studies identified through the keyword search the title and abstract was screened for 5498 studies. The full text manuscripts of 106 articles were retrieved to evaluate their eligibility. Twenty one manuscripts met the inclusion/exclusion criteria.

Conclusion: The treatment potential of combining self-care and exercise interventions has not been maximised through limitations in study design and failure to adequately define intervention content. Potentially the most beneficial self-care treatment components are training self-management skills, information delivery and goal setting. Exercise treatment components could be strengthened by better attention to dose and progression. Technology should be considered to streamline delivery, as high levels of supervision are not required. More emphasis is required on using self-care and exercise programs for chronic condition prevention in addition to chronic condition management.

Keywords: self-care, exercise, knee, rehabilitation, patient education
Irena Spasić, Jacqueline Livsey, John Keane and Goran Nenadić (2014) Text mining of cancer-related information: review of current status and future directions. International Journal of Medical Informatics, Vol. 83, No. 9, pp. 605-623 [PMID: 25008281] [DOI: 10.1016/j.ijmedinf.2014.06.009]
Purpose: This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research.

Methods: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar.

Results: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.

Keywords: Cancer, Natural language processing, Data mining, Electronic medical records
Christian Bannister, Craig Currie, Alun Preece and Irena Spasić (2014) Automatic development of clinical prediction models with genetic programming: A case study in cardiovascular disease. Value in Health, Vol. 17, No. 3, pp. A200-A201 [DOI: 10.1016/j.jval.2014.03.1171]

Simon Moore, Claire O'Brien, Mohammed Fasihul Alam, David Cohen, Kerenza Hood, Chao Huang, Laurence Moore, Simon Murphy, Rebecca Playle, Vaseekaran Sivarajasingam, Irena Spasić, Anne Williams and Jonathan Shepherd (2014) All-Wales licensed premises intervention (AWLPI): a randomised controlled trial to reduce alcohol-related violence. BMC Public Health, Vol. 14, 21 [PMID: 24405575] [DOI: 10.1186/1471-2458-14-21]
Background Alcohol-related violence in and in the vicinity of licensed premises continues to place a considerable burden on the United Kingdom's (UK) health services. Robust interventions targeted at licensed premises are therefore required to reduce the costs of alcohol-related harm. Previous evaluations of interventions in licensed premises have a number of methodological limitations and none have been conducted in the UK. The aim of the trial was to determine the effectiveness of the Safety Management in Licensed Environments intervention designed to reduce alcohol-related violence in licensed premises, delivered by Environmental Health Officers, under their statutory authority to intervene in cases of violence in the workplace.

Methods A national randomised controlled trial, with licensed premises as the unit of allocation. Premises were identified from all 22 Local Authorities in Wales. Eligible premises were those with identifiable violent incidents on premises, using police recorded violence data. Premises were allocated to intervention or control by optimally balancing by Environmental Health Officer capacity in each Local Authority, number of violent incidents in the 12 months leading up to the start of the project and opening hours. The primary outcome measure is the difference in frequency of violence between intervention and control premises over a 12 month follow-up period, based on a recurrent event model. The trial incorporates an embedded process evaluation to assess intervention implementation, fidelity, reach and reception, and to interpret outcome effects, as well as investigate its economic impact.

Discussion The results of the trial will be applicable to all statutory authorities directly involved with managing violence in the night time economy and will provide the first formal test of Health and Safety policy in this environment. If successful, opportunities for replication and generalisation will be considered.
Christian Bannister, Chris Poole, Sara Jenkins-Jones, Christopher Morgan, Glyn Elwyn, Irena Spasić and Craig Currie (2014) External validation of the UKPDS risk engine in incident type 2 diabetes: a need for new type 2 diabetes-specific risk equations. Diabetes Care, Vol. 37, No. 2, pp. 537-545 [PMID: 24089541] [DOI: 10.2337/dc13-1159]
Objective To evaluate the performance of the United Kingdom Prospective Diabetes Study (UKPDS) Risk Engine for predicting the 10-year risk of cardiovascular disease endpoints in an independent cohort of UK patients newly diagnosed with type 2 diabetes.

Research Design and Methods This was a retrospective cohort study using routine healthcare data collected between April 1998 and October 2011 from around 350 UK primary-care practices contributing to the Clinical Practice Research Datalink (CPRD). Participants comprised 79,966 patients aged between 35 and 85 years (388 269 person years) with 4,984 cardiovascular events. Four outcomes were evaluated: first diagnosis of coronary heart disease (CHD), stroke, fatal CHD, and fatal stroke.

Results Accounting for censoring, the observed versus predicted ten-year event rates were as follows: CHD 6.1% vs 16.5%, fatal CHD 1.9% vs 10.1%, stroke 7.0% vs 10.1%, and fatal stroke 1.7% vs 1.6%, respectively. The UKPDS-RE showed moderate discrimination for all four outcomes, with the concordance-index values ranging from 0.65 to 0.78.

Conclusions The UKPDS stroke equations showed calibration ranging from poor to moderate; however, the CHD equations showed poor calibration and considerably overestimated CHD risk. There is a need for revised risk equations in type 2 diabetes.
Irena Spasić, Mark Greenwood, Alun Preece, Nick Francis and Glyn Elwyn (2013) FlexiTerm: A flexible term recognition method. Journal of Biomedical Semantics, Vol. 4, 27 [PMID: 24112363] [DOI: 10.1186/2041-1480-4-27] [Web: FlexiTerm]
Background The increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation.

Results In this paper, we describe FlexiTerm, a method for automatic term recognition from a domain-specific corpus, and evaluate its performance against five manually annotated corpora. FlexiTerm performs term recognition in two steps: linguistic filtering is used to select term candidates followed by calculation of termhood, a frequency-based measure used as evidence to qualify a candidate as a term. In order to improve the quality of termhood calculation, which may be affected by the term variation phenomena, FlexiTerm uses a range of methods to neutralise the main sources of variation in biomedical terms. It manages syntactic variation by processing candidates using a bag-of-words approach. Orthographic and morphological variations are dealt with using stemming in combination with lexical and phonetic similarity measures. The method was evaluated on five biomedical corpora. The highest values for precision (94.56%), recall (71.31%) and F-measure (81.31%) were achieved on a corpus of clinical notes.

Conclusions FlexiTerm is an open-source software tool for automatic term recognition. It incorporates a simple term variant normalisation method. The method proved to be more robust than the baseline against less formally structured texts, such as those found in patient blogs or medical notes. The software can be downloaded freely at
Kate Button, Robert W. van Deursen, Larisa Soldatova and Irena Spasić (2013) TRAK ontology: Defining standard care for the rehabilitation of knee conditions. Journal of Biomedical Informatics, Vol. 46, No. 4, pp. 615-625 [PMID: 23665300] [DOI: 10.1016/j.jbi.2013.04.009] [BioPortal: 3210] [Web: TRAK]
In this paper we discuss the design and development of TRAK (Taxonomy for RehAbilitation of Knee conditions), an ontology that formally models information relevant for the rehabilitation of knee conditions. TRAK provides the framework that can be used to collect coded data in sufficient detail to support epidemiologic studies so that the most effective treatment components can be identified, new interventions developed and the quality of future randomized control trials improved to incorporate a control intervention that is well defined and reflects clinical practice. TRAK follows design principles recommended by the Open Biomedical Ontologies (OBO) Foundry. TRAK uses the Basic Formal Ontology (BFO) as the upper-level ontology and refers to other relevant ontologies such as Information Artifact Ontology (IAO), Ontology for General Medical Science (OGMS), Phenotype And Trait Ontology (PATO), etc. TRAK is orthogonal to other bio-ontologies and represents domain-specific knowledge about treatments and modalities used in rehabilitation of knee conditions. Definitions of typical exercises used as treatment modalities are supported with appropriate illustrations, which can be viewed in the OBO-Edit ontology editor. The vast majority of other classes in TRAK are cross-referenced to the Unified Medical Language System (UMLS) to facilitate future integration with other terminological sources. TRAK is implemented in OBO, a format widely used by the OBO community. TRAK is available for download from In addition, its public release can be accessed through BioPortal, where it can be browsed, searched and visualized.

Keywords: ontology; taxonomy; knee; rehabilitation; physiotherapy
Kieran Smallbone, Hanan Messiha, Kathleen Carroll, Catherine Winder, Naglis Malys, Warwick Dunn, Ettore Murabito, Neil Swainston, Joseph Dada, Farid Khan, Pinar Pir, Evangelos Simeonidis, Irena Spasić, Jill Wishart, Dieter Weichart, Neil Hayes, Daniel Jameson, David Broomhead, Stephen Oliver, Simon Gaskell, John McCarthy, Norman Paton, Hans Westerhoff, Douglas Kell and Pedro Mendes (2013) A model of yeast glycolysis based on a consistent kinetic characterisation of all its enzymes. FEBS Letters, Vol. 587, No. 17, pp. 2832-2841 [PMID: 23831062] [DOI: 10.1016/j.febslet.2013.06.043]
We present an experimental and computational pipeline for the generation of kinetic models of metabolism, and demonstrate its application to glycolysis in Saccharomyces cerevisiae. Starting from an approximate mathematical model, we employ a "cycle of knowledge" strategy, identifying the steps with most control over flux. Kinetic parameters of the individual isoenzymes within these steps are measured experimentally under a standardised set of conditions. Experimental strategies are applied to establish a set of in vivo concentrations for isoenzymes and metabolites. The data are integrated into a mathematical model that is used to predict a new set of metabolite concentrations and reevaluate the control properties of the system. This bottom-up modelling study reveals that control over the metabolic network most directly involved in yeast glycolysis is more widely distributed than previously thought.

Keywords: glycolysis, systems biology, enzyme kinetics, isoenzymes, modelling
Irena Spasić, Peter Burnap, Mark Greenwood and Michael Arribas-Ayllon (2012) A naive Bayes approach to topic classification in suicide notes. Biomedical Informatics Insights, Vol. 5, Suppl. 1, pp. 87-97 [PMID: 22879764] [DOI: 10.4137/BII.S8945]
The authors present a system developed for the 2011 i2b2 Challenge on Sentiment Classification, whose aim was to automatically classify sentences in suicide notes using a scheme of 15 topics, mostly emotions. The system combines machine learning with a rule-based methodology. The features used to represent a problem were based on lexico-semantic properties of individual words in addition to regular expressions used to represent patterns of word usage across different topics. A naive Bayes classifier was trained using the features extracted from the training data consisting of 600 manually annotated suicide notes. Classification was then performed using the naive Bayes classifier as well as a set of pattern-matching rules. The classification performance was evaluated against a manually prepared gold standard consisting of 300 suicide notes, in which 1,091out of a total of 2,037 sentences were associated with a total of 1,272 annotations. The competing systems were ranked using the micro-averaged F-measure as the primary evaluation metric. Our system achieved the F-measure of 53% (with 55% precision and 52% recall), which was significantly better than the average performance of 48.75% achieved by the 26 participating teams.

Keywords: natural language processing, sentiment analysis, topic classification, naive Bayes classifier
Peter Li, Joseph Dada, Daniel Jameson, Irena Spasić, Neil Swainston, Kathleen Carroll, Warwick Dunn, Farid Khan, Hanan Messiha, Evangelos Simeonidis, Dieter Weichart, Catherine Winder, David Broomhead, Carole Goble, Simon Gaskell, Douglas Kell, Hans Westerhoff, Pedro Mendes and Norman Paton (2010) Systematic integration of experimental data and models in systems biology. BMC Bioinformatics, Vol. 11, 582 [PMID: 21114840] [DOI: 10.1186/1471-2105-11-582] highly accessed
Background: The behaviour of biological systems can be deduced from their mathematical models. However, multiple sources of data in diverse forms are required in the construction of a model in order to define its components and their biochemical reactions, and corresponding parameters. Automating the assembly and use of systems biology models is dependent upon data integration processes involving the interoperation of data and analytical resources.

Results: Taverna workflows have been developed for the automated assembly of quantitative parameterised metabolic networks in the Systems Biology Markup Language (SBML). A SBML model is built in a systematic fashion by the workflows which starts with the construction of a qualitative network using data from a MIRIAM-compliant genome-scale model of yeast metabolism. This is followed by parameterisation of the SBML model with experimental data from two repositories, the SABIO-RK enzyme kinetics database and a database of quantitative experimental results. The models are then calibrated and simulated in workflows that call out to COPASIWS, the web service interface to the COPASI software application for analysing biochemical networks. These systems biology workflows were evaluated for their ability to construct a parameterised model of yeast glycolysis.

Conclusions: Distributed information about metabolic reactions that have been described to MIRIAM standards enables the automated assembly of quantitative systems biology models of metabolic networks based on user-defined criteria. Such data integration processes can be implemented as Taverna workflows to provide a rapid overview of the components and their relationships within a biochemical system.
Irena Spasić, Farzaneh Sarafraz, John Keane and Goran Nenadić (2010) Medication information extraction with linguistic pattern matching and semantic rules. Journal of the American Medical Informatics Association, Vol. 17, No. 5, pp. 532-535 [PMID: 20819858] [DOI: 10.1136/jamia.2010.003657]
Objective: We present a system developed for the 2009 i2b2 Challenge in Natural Language Processing for Clinical Data, whose aim was to automatically extract certain information about medications used by a patient from his/her medical report. The aim was to extract the following information for each medication: name, dosage, mode/route, frequency, duration, and reason.

Design: The system implements a rule-based methodology, which exploits typical morphological, lexical, syntactic and semantic features of the targeted information. These features were acquired from the training dataset and public resources such as the UMLS and relevant web pages. Information extracted via pattern matching was combined together using context-sensitive heuristic rules.

Measurements: The system was applied to a set of 547 previously unseen discharge summaries, and the extracted information was evaluated against a manually prepared gold standard consisting of 251 documents. The overall ranking of the participating teams was obtained using the micro-averaged F-measure as the primary evaluation metric.

Results: The implemented method achieved the micro-averaged F-measure of 81% (with 86% precision and 77% recall), which ranked our system third in the challenge. The significance tests revealed our system's performance to be not significantly different from that of the second ranked system. Relative to other systems, our system achieved the best F-measure for the extraction of duration (53%) and reason (46%).

Conclusion: Based on the F-measure, the performance achieved (81%) was in line with the initial agreement between human annotators (82%), indicating that such system may greatly facilitate the process of extracting relevant information from medical records by providing a solid basis for a manual review process.
Joseph Dada, Irena Spasić, Norman Paton and Pedro Mendes (2010) SBRML: a markup language for associating systems biology data with models. Bioinformatics, Vol. 26, No. 7, pp. 932-938 [PMID: 20176582] [DOI: 10.1093/bioinformatics/btq069]
Motivation: Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (SBML) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding data sets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations.

Results: We propose the Systems Biology Results Markup Language (SBRML), an XML-based language which associates a model with several data sets. Each data set is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression, and various types of simulation results.

Availability and Implementation: The XML Schema file for SBRML is available at under the Academic Free License (AFL) v3.0.
Irena Spasić, Evangelos Simeonidis, Hanan Messiha, Norman Paton and Douglas Kell (2009) KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways. Bioinformatics, Vol. 25, No. 11, pp. 1404-1411 [PMID: 19336445] [DOI: 10.1093/bioinformatics/btp175]
Motivation: Most experimental evidence on kinetic parameters is buried in the literature, whose manual searching is complex, time-consuming and partial. These shortcomings become particularly acute in systems biology, where these parameters need to be integrated into detailed, genome-scale, metabolic models. These problems are addressed by KiPar, a dedicated information retrieval sys-tem designed to facilitate access to the literature relevant for kinetic modelling of a given metabolic pathway in yeast. Searching for kinetic data in the context of an individual pathway offers modularity as a way of tackling the complexity of developing a full metabolic model. It is also suitable for large-scale mining, since multiple reactions and their kinetic parameters can be specified in a single search request, rather than one reaction at a time, which is unsuitable given the size of genome-scale models.

Results: We developed an integrative approach, combining public data and software resources for the rapid development of large-scale text mining tools targeting complex biological information. The user supplies input in the form of identifiers used in relevant data resources to refer to the concepts of interest, e.g. EC numbers, GO and SBO identifiers. By doing so, the user is freed from providing any other knowledge or terminology concerned with these concepts and their relations, since they are retrieved from these and cross-referenced resources automatically. The terminology acquired is used to index the literature by mapping concepts to their synonyms, and then to textual documents mentioning them. The indexing results and the previously acquired knowledge about relations between concepts are used to formulate complex search queries aiming at documents relevant to the user's information needs. The conceptual approach is demonstrated in the implementation of KiPar. Evaluation reveals that KiPar performs better than a Boolean search. The precision achieved for abstracts (60%) and full-text articles (48%) is considerably better than the baseline precision (44% and 24% respectively). The baseline recall is improved by 36% for abstracts and by 100% for full text. It appears that full-text articles are a much richer source of information on kinetic data than are their abstracts. Finally, the combined results for abstracts and full text compared to curated literature provide high values for relative recall (88%) and novelty ratio (92%), suggesting that the system is able to retrieve a high proportion of new documents.

Availability: Source code and documentation are available at:
Hui Yang, Irena Spasić, John Keane and Goran Nenadić (2009) A text mining approach to the prediction of a disease status from clinical discharge summaries. Journal of the American Medical Informatics Association, Vol. 16, No. 4, pp. 596-600 [PMID: 19390098] [DOI: 10.1197/jamia.M3096]
We present a system developed for the Challenge in Natural Language Processing for Clinical Data - the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted. We assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based and machine-learning methods. The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7th out of 28 teams - the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations. The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries.
Marie Brown, Warwick Dunn, Paul Dobson, Yogendra Patel, Cate Winder, Sue Francis-McIntyre, Paul Begley, Kathleen Carroll, David Broadhurst, Andy Tseng, Neil Swainston, Irena Spasić, Royston Goodacre and Douglas Kell (2009) Mass spectrometry tools and metabolite-specific databases for molecular identification in metabolomics. The Analyst, Vol. 134, No. 7, pp. 1322-1332 [PMID: 19562197] [DOI: 10.1039/b901179j]
The chemical identification of mass spectrometric signals in metabolomic applications is important to provide conversion of analytical data to biological knowledge about metabolic pathways. The complexity of electrospray mass spectrometric data acquired from a range of samples (serum, urine, yeast intracellular extracts, yeast metabolic footprints, placental tissue metabolic footprints) has been investigated and has defined the frequency of different ion types routinely detected. Although some ion types were expected (protonated and deprotonated peaks, isotope peaks, multiply charged peaks) others were not expected (sodium formate adduct ions). In parallel, the Manchester Metabolomics Database (MMD) has been constructed with data from genome scale metabolic reconstructions, HMDB, KEGG, Lipid Maps, BioCyc and DrugBank to provide knowledge on 42,687 endogenous and exogenous metabolite species. The combination of accurate mass data for a large collection of metabolites, theoretical isotope abundance data and knowledge of the different ion types detected provided a greater number of electrospray mass spectrometric signals which were putatively identified and with greater confidence in the samples studied. To provide definitive identification metabolite-specific mass spectral libraries for UPLC-MS and GC-MS have been constructed for 1,065 commercially available authentic standards. The MMD data are available at
Markus J. Herrgård, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalçin Arga, Mikko Arvas, Nils Blüthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Peter Li, Wolfram Liebermeister, Monica L. Mo, Ana Paula Oliveira, Dina Petranović, Stephen Pettifer, Evangelos Simeonidis, Kieran Smallbone, Irena Spasić, Dieter Weichart, Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betül Kirdar, Merja Penttilä, Edda Klipp, Bernhard Ø. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen and Douglas B. Kell (2008) A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology, Vol. 26, No. 10, pp. 1155-1160 [PMID: 18846089] [DOI: 10.1038/nbt1492]
Genomic data allow the large-scale manual or semi-automated assembly of metabolic network reconstructions, which provide highly curated organism-specific knowledge bases. Although several genome-scale network reconstructions describe Saccharomyces cerevisiae metabolism, they differ in scope and content, and use different terminologies to describe the same chemical entities. This makes comparisons between them difficult and underscores the desirability of a consolidated metabolic network that collects and formalizes the 'community knowledge' of yeast metabolism. We describe how we have produced a consensus metabolic network reconstruction for S. cerevisiae. In drafting it, we placed special emphasis on referencing molecules to persistent databases or using database-independent forms, such as SMILES or InChI strings, as this permits their chemical structure to be represented unambiguously and in a manner that permits automated reasoning. The reconstruction is readily available via a publicly accessible database and in the Systems Biology Markup Language ( It can be maintained as a resource that serves as a common denominator for studying the systems biology of yeast. Similar strategies should benefit communities studying genome-scale metabolic networks of other organisms.
Irena Spasić, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas Kell and Norman Paton (2008) Facilitating the development of controlled vocabularies for metabolomics technologies with text mining. BMC Bioinformatics, Vol. 9, Suppl. 5, S5 [PMID: 18460187] [DOI: 10.1186/1471-2105-9-S5-S5]
Background: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually.

Results: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts.

Conclusions: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.
Warwick Dunn, David Broadhurst, David Ellis, Marie Brown, Anthony Halsall, Steven O'Hagan, Irena Spasić, Andrew Tseng and Douglas Kell (2008) A GC-TOF-MS study of the stability of serum and urine metabolomes during the UK Biobank sample collection and preparation protocols. International Journal of Epidemiology, Vol. 37, pp. i23-i30 [PMID: 18381390] [DOI: 10.1093/ije/dym281]
Background: The stability of mammalian serum and urine in large metabolomic investigations is essential for accurate, valid and reproducible studies. The stability of mammalian serum and urine, either processed immediately by freezing at -80C or stored at 4C for 24 hours before being frozen, was compared in a pilot metabolomic study of samples from 40 separate healthy volunteers.

Methods: Metabolic profiling with GC-TOF-MS was performed for serum and urine samples collected from 40 volunteers and stored at -80C or 4C for 24 hours before being frozen. Subsequent Kruskal-Wallis and Principal Components Analysis methods were used to assess whether metabolomic differences were detected between samples stored at 4C for 0 or 24 hours.

Results: More than 700 unique metabolite peaks were detected, with over 200 metabolite peaks detected in any one sample. Principal Components Analysis (PCA) of serum and urine data showed that the variance associated with the replicate analysis per sample (analytical variance) was of the same magnitude as the variance observed between samples stored at 4C or -80C for 24 hours (biological variance). From a functional point of view the metabolomic composition of samples did not change in a statistically significant manner when stored under the two different conditions.

Conclusions: Based on this small pilot study, the UK Biobank sampling, transport and fractionation protocols are considered suitable to provide samples which can produce scientifically robust and valid data in metabolomic studies.

Keywords: metabolomics, metabolic profiling, GC-MS, univariate analysis, multivariate analysis, biofluid, serum, urine
Warwick Dunn, David Broadhurst, Sasalu Deepak, Mamta Buch, Garry McDowell, Irena Spasić, David Ellis, Nicholas Brooks, Douglas Kell and Ludwig Neyses (2007) Serum metabolomics reveals many novel metabolic markers of heart failure, including pseudouridine and 2-oxoglutarate. Metabolomics, Vol. 3, No. 4, pp. 413-426 [DOI: 10.1007/s11306-007-0063-5]
There is intense interest in the identification of novel biomarkers which improve the diagnosis of heart failure. Serum samples from 52 patients with s ystolic heart failure (EF<40% plus signs and symptoms of failure) and 57 controls were analyzed by gas chromatography - time of flight - mass spectrometry and the raw data reduced to 272 statistically robust metabolite peaks. 38 peaks showed a significant difference between case and control (p<5×10-5). Two such metabolites were pseudouridine, a modified nucleotide present in t- and rRNA and a marker of cell turnover, as well as the tricarboxylic acid cycle intermediate 2-oxoglutarate. Furthermore, three further compounds were also excellent discriminators between patients and controls: 2-hydroxy, 2-methylpropanoic acid, erythritol and 2,4,6-trihydroxypyrimidine. These findings demonstrate the power of data-driven metabolomics approaches to identify such markers of disease.

Keywords: heart failure, metabolomics, biomarkers, pseudouridine, 2-oxoglutarate.
Susanna-Assunta Sansone, Daniel Schober, Helen Atherton, Oliver Fiehn, Helen Jenkins, Philippe Rocca-Serra, Denis Rubtsov, Irena Spasić, Larisa Soldatova, Chris Taylor, Andy Tseng, Mark Viant and the Ontology Working Group Members (2007) Metabolomics Standards Initiative - Ontology Working Group: Work in progress. Metabolomics, Vol. 3, No. 3, pp. 249-256 [DOI: 10.1007/s11306-007-0069-z]
In this article we present the activities of the Ontology Working Group (OWG) under the Metabolomics Standards Initiative (MSI) umbrella. Our endeavour aims to synergise the work of several communities, where independent activities are underway to develop terminologies and databases for metabolomics investigations. We have joined forces to rise to the challenges associated with interpreting and integrating experimental process and data across disparate sources (software and databases, private and public). Our focus is to support the activities of the other MSI working groups by developing a common semantic framework to enable metabolomics-user communities to consistently annotate the experimental process and to enable meaningful exchange of datasets. Our work is accessible via a public webpage and a draft ontology has been posted under the Open Biological Ontology umbrella. At the very outset, we have agreed to minimize duplications across omics domains through extensive liaisons with other communities under the OBO Foundry. This is work in progress and we welcome new participants willing to volunteer their time and expertise to this open effort.

Keywords: controlled vocabulary, annotation, terminology, semantic, metadata, ontology, functional genomics, metabolomics, metabonomics, standard, Metabolomics Society, Metabolomics Standards Initiative, OBO.
Irena Spasić, Warwick Dunn, Giles Velarde, Andy Tseng, Helen Jenkins, Nigel Hardy, Stephen Oliver and Douglas Kell (2006) MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics. BMC Bioinformatics, Vol. 7, 281 [PMID: 16753052] [DOI: 10.1186/1471-2105-7-281] highly accessed
Background: The genome sequencing projects have shown our limited knowledge regarding gene function, e.g. S. cerevisiae has 5-6,000 genes of which nearly 1,000 have an uncertain function. Their gross influence on the behaviour of the cell can be observed using large-scale metabolomic studies. The metabolomic data produced need to be structured and annotated in a machine-usable form to facilitate the exploration of the hidden links between the genes and their functions.

Description: MeMo is a formal model for representing metabolomic data and the associated metadata. Two predominant platforms (SQL and XML) are used to encode the model. MeMo has been implemented as a relational database using a hybrid approach combining the advantages of the two technologies. It represents a practical solution for handling the sheer volume and complexity of the metabolomic data effectively and efficiently. The MeMo model and the associated software are available at

Conclusions: The maturity of relational database technology is used to support efficient data processing. The scalability and self-descriptiveness of XML are used to simplify the relational schema and facilitate the extensibility of the model necessitated by the creation of new experimental techniques. Special consideration is given to data integration issues as part of the systems biology agenda. MeMo has been physically integrated and cross-linked to related metabolomic and genomic databases. Semantic integration with other relevant databases has been supported through ontological annotation. Compatibility with other data formats is supported by automatic conversion.
Stephen Wilkinson, Irena Spasić and David Ellis (2006) Genomes to Systems 3. Metabolomics, Vol. 2, No. 3, pp. 165-170 [DOI: 10.1007/s11306-006-0030-6]
A report on the third Genomes to Systems consortium conference, which portrayed the breadth of the post-genome sciences including Genomics, Transcriptomics, Proteomics, Metabolomics, Informatics, and integrative Systems Biology.
Irena Spasić, Sophia Ananiadou, John McNaught and Anand Kumar (2005) Text mining and ontologies in biomedicine: making sense of raw text. Briefings in Bioinformatics, Vol. 6, No. 3, pp. 239-251 [PMID: 16212772] [DOI: 10.1093/bib/6.3.239]
The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill information, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. In this article, we summarize different approaches in which ontologies have been used for text mining applications in biomedicine.
Douglas Kell, Marie Brown, Hazel Davey, Warwick Dunn, Irena Spasić and Stephen Oliver (2005) Metabolic footprinting and systems biology: the medium is the message. Nature Reviews Microbiology, Vol. 3, No. 7, pp. 557-565 [PMID: 15953932] [DOI: 10.1038/nrmicro1177]
One element of classical systems analysis treats a system as a black or grey box, the inner structure and behaviour of which can be analysed and modelled by varying an internal or external condition, probing it from outside and studying the effect of the variation on the external observables. The result is an understanding of the inner make-up and workings of the system. The equivalent of this in biology is to observe what a cell or system excretes under controlled conditions - the 'metabolic footprint' or exometabolome - as this is readily and accurately measurable. Here, we review the principles, experimental approaches and scientific outcomes that have been obtained with this useful and convenient strategy.
Irena Spasić, Sophia Ananiadou and Junichi Tsujii (2005) MaSTerClass: a case-based reasoning system for the classification of biomedical terms. Bioinformatics, Vol. 21, No. 11, pp. 2748-2758 [PMID: 15728115] [DOI: 10.1093/bioinformatics/bti338]
Motivation: The sheer volume of textually described biomedical knowledge exerts the need for natural language processing (NLP) applications in order to allow flexible and efficient access to relevant information. Specialised semantic networks (such as biomedical ontologies, terminologies or semantic lexicons) can significantly enhance these applications by supplying the necessary terminological information in a machinereadable form. Due to the explosive growth of bio-literature, new terms (representing newly identified concepts or variations of the existing terms) may not be explicitly described within the network and hence cannot be fully exploited by NLP applications. Linguistic and statistical clues can be used to extract many new terms from free text. The extracted terms still need to be correctly positioned relative to other terms in the network. Classification as a means of semantic typing represents the first step in updating a semantic network with new terms.

Results: The MaSTerClass system implements the case-based reasoning methodology for the classification of biomedical terms.

Availability: MaSTerClass is available at It is distributed under an open source licence for educational and research purposes. The software requires Java, JWSDP, Ant, MySQL and X-hive to be installed and licences obtained separately where needed.
Marie Brown, Warwick Dunn, David Ellis, Royston Goodacre, Julia Handl, Joshua Knowles, Steve O'Hagan, Irena Spasić and Douglas Kell (2005) A Metabolome pipeline: from concept to data to knowledge. Metabolomics, Vol. 1, No. 1, pp. 39-51 [DOI: 10.1007/s11306-005-1106-4]
Metabolomics, like others omics methods, produces huge datasets of biological variables, along with the necessary metadata. However, regardless of the form in which these are produced they are merely the ground substance for assisting us in answering biological questions. In this short tutorial review and position paper we seek to set out some of the elements of 'best practice' in the optimal acquisition of such data, and in the means by which they may be turned into reliable knowledge. Many of these steps involve the solution of what amount to combinatorial optimization problems, and methods developed for these are, especially those based on evolutionary computing, are proving valuable. This is done in terms of a 'pipeline' that goes from the design of good experiments, through instrumental optimization, data storage and manipulation, the chemometric data processing methods in common use, and the necessary means of validation and cross-validation for giving conclusions that are credible and likely to be robust when applied in comparable circumstances and to samples not used in their generation.
Irena Spasić and Sophia Ananiadou (2004) Using automatically learnt verb selectional preferences for classification of biomedical terms. Journal of Biomedical Informatics, Vol. 37, No. 6, pp. 483-497 [PMID: 15542021] [DOI: 10.1016/j.jbi.2004.08.002]
In this paper, we present an approach to term classification based on verb selectional patterns (VSPs), where such a pattern is defined as a set of semantic classes that could be used in combination with a given domain-specific verb. VSPs have been automatically learnt based on the information found in a corpus and an ontology in the biomedical domain. Prior to the learning phase, the corpus is terminologically processed: term recognition is performed by both looking up the dictionary of terms listed in the ontology and applying the C/NC-value method for on-the-fly term extraction. Subsequently, domain-specific verbs are automatically identified in the corpus based on the frequency of occurrence and the frequency of their co-occurrence with terms. VSPs are then learnt automatically for these verbs. Two machine learning approaches are presented. The first approach has been implemented as an iterative generalisation procedure based on a partial order relation induced by the domain-specific ontology. The second approach exploits the idea of genetic algorithms. Once the VSPs are acquired, they can be used to classify newly recognised terms co-occurring with domain-specific verbs. Given a term, the most frequently co-occurring domain-specific verb is selected. Its VSP is used to constrain the search space by focusing on potential classes of the given term. A nearest-neighbour approach is then applied to select a class from the constrained space of candidate classes. The most similar candidate class is predicted for the given term. The similarity measure used for this purpose combines contextual, lexical, and syntactic properties of terms.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2004) Mining term similarities from corpora. Terminology, Vol. 10, No. 1, pp. 55-80 [DOI: 10.1075/term.10.1.04nen]
In this article we present an approach to the automatic discovery of term similarities, which may serve as a basis for a number of term-oriented knowledge mining tasks. The method for term comparison combines internal (lexical similarity) and two types of external criteria (syntactic and contextual similarities). Lexical similarity is based on sharing lexical constituents (i.e. term heads and modifiers). Syntactic similarity relies on a set of specific lexico-syntactic co-occurrence patterns indicating the parallel usage of terms (e.g. within an enumeration or within a term coordination/conjunction structure), while contextual similarity is based on the usage of terms in similar contexts. Such contexts are automatically identified by a pattern mining approach, and a procedure is proposed to assess their domain-specific and terminological relevance. Although automatically collected, these patterns are domain dependent and identify contexts in which terms are used. Different types of similarities are combined into a hybrid similarity measure, which can be tuned for a specific domain by learning optimal weights for individual similarities. The suggested similarity measure has been tested in the domain of biomedicine, and some experiments are presented.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Terminology-driven mining of biomedical literature. Bioinformatics, Vol. 19, No. 8, pp. 938-943 [PMID: 12761055] [DOI: 10.1093/bioinformatics/btg105]
In this paper we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts are presented.
Goran Nenadić, Hideki Mima, Irena Spasić, Sophia Ananiadou and Junichi Tsujii (2002) Terminology-based literature mining and knowledge acquisition in biomedicine. International Journal of Medical Informatics, Vol. 67, No. 1-3, pp. 33-48 [PMID: 12460630] [DOI: 10.1016/S1386-5056(02)00055-2]
In this paper we describe TIMS, an integrated knowledge management system for the domain of molecular biology and biomedicine, in which terminology-driven literature mining, knowledge acquisition, knowledge integration, and XML-based knowledge retrieval are combined using tag information management and ontology inference. The system integrates automatic terminology acquisition, term variation management, hierarchical term clustering, tag-based information extraction, and ontology-based query expansion. TIMS supports introducing and combining different types of tags (linguistic and domain-specific, manual and automatic). Tag-based interval operations and a query language are introduced in order to facilitate knowledge acquisition and retrieval from XML documents. Through knowledge acquisition examples, we illustrate the way in which literature mining techniques can be utilised for knowledge discovery from documents.

Refereed book chapters:

Irena Spasić and Douglas Kell (2013) Searching for kinetic parameters with KiPar. In W. Dubitzky, O. Wolkenhauer, H. Yokota and K.-H. Cho (Eds.): Encyclopedia of Systems Biology, Springer
Definition KiPar is an Information Retrieval system designed to facilitate access to the literature relevant for kinetic modeling of metabolic pathways in yeast. Information supplied as user input includes the enzymes catalyzing the reactions of interest and the parameters whose values are required for kinetic modeling. The output is produced as a list of documents (either abstracts from PubMed or full-text articles from PubMed Central) that should contain the required values of kinetic parameters. There are two groups of users of this specific application: (1) experimentalists who wish to compare experimentally estimated values of kinetic parameters to those reported in the literature, and (2) mathematical modelers who wish to incorporate known values of kinetic parameters into metabolic models.
Neil Swainston, Daniel Jameson, Peter Li, Irena Spasić, Pedro Mendes and Norman Paton (2010) Integrative information management for systems biology. In P. Lambrix and G. Kemp (Eds.): Data Integration in the Life Sciences, LNCS 6254, Springer, pp. 164-178 [DOI: 10.1007/978-3-642-15120-0_13]
Systems biology develops mathematical models of biological systems that seek to explain, or better still predict, how the system behaves. In bottom-up systems biology, systematic quantitative experimentation is carried out to obtain the data required to parameterize models, which can then be analyzed and simulated. This paper describes an approach to integrated information management that supports bottom-up systems biology, with a view to automating, or at least minimizing the manual effort required during, creation of quantitative models from qualitative models and experimental data. Automating the process makes model construction more systematic, supports good practice at all stages in the pipeline, and allows timely integration of high throughput experimental results into models.

Keywords: computational systems biology, workflow
Marco Masseroli, Norman Paton and Irena Spasić (2010) Search computing and the life sciences. In S. Ceri and M. Brambilla (Eds.): Search Computing Challenges and Directions, LNCS 5950, Springer, pp. 291-306 [DOI: 10.1007/978-3-642-12310-8_15]
Search Computing has been proposed to support the integration of the results of search engines with other data and computational resources. A key feature of the resulting integration platform is direct support for multi-domain ordered data, reflecting the fact that search engines produce ranked outputs, which should be taken into account when the results of several requests are combined. In the life sciences, there are many different types of ranked data. For example, ranked data may represent many different phenomena, including physical ordering within a genome, algorithmically assigned scores that represent levels of sequence similarity, and experimentally measured values such as expression levels. This chapter explores the extent to which the search computing functionalities designed for use with search engine results may be applicable for different forms of ranked data that are encountered when carrying out data integration in the life sciences. This is done by classifying different types of ranked data in the life sciences, providing examples of different types of ranking and ranking integration needs in the life sciences, identifying issues in the integration of such ranked data, and discussing techniques for drawing conclusions from diverse rankings.

Keywords: search computing, bioinformatics, data integration, ranked data
Goran Nenadić and Irena Spasić (2008) Towards automatic terminology extraction in Serbian. In G. Zybatow et al. (Eds.): Formal Description of Slavic Languages, Peter Lang, Frankfurt/Main
In this article we discuss an automatic approach used to facilitate terminological processing of texts in Serbian. The approach combines linguistic knowledge (term formation patterns) with corpus-based statistical measures. Generic morpho-syntactic filters are used to extract individual term candidates. The filters encode information on grammatical agreements used as clues to discover boundaries of term candidates in a free text and to determine their inner structure (nested terms). The extracted candidates are further assorted into sets that unify their inflectional and lexical variants, and are subsequently assigned termhoods (i.e. likelihood to represent actual terms). We present preliminary results of the terminological processing of a textbook corpus in the domains of mathematics and computer science.
Irena Spasić and Sophia Ananiadou (2005) A flexible measure of contextual similarity for biomedical terms. In R. Altman et al. (Eds.): Pacific Symposium on Biocomputing - PSB 2005. World Scientific Publishing Company, Singapore, pp. 197-208 [PMID: 15759626]
We present a measure of contextual similarity for biomedical terms. The contextual features need to be explored, because newly coined terms are not explicitly described and efficiently stored in biomedical ontologies and their inner features (e.g. morphologic or orthographic) do not always provide sufficient information about the properties of the underlying concepts. The context of each term can be represented as a sequence of syntactic elements annotated with biomedical information retrieved from an ontology. The sequences of contextual elements may be matched approximately by edit distance defined as the minimal cost incurred by the changes (including insertion, deletion and replacement) needed to transform one sequence into the other. Our approach augments the traditional concept of edit distance by elements of linguistic and biomedical knowledge, which together provide flexible selection of contextual features and their comparison.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2005) Mining biomedical abstracts: what is in a term?. In K.Y. Su et al. (Eds.): Natural Language Processing - IJCNLP 2004. LNAI 3248, Springer, pp. 797-806
In this paper we present a study of the usage of terminology in biomedical literature, with the main aim to indicate phenomena that can be helpful for automatic term recognition in the domain. Our comparative analysis is based on the terminology used in the Genia corpus. We analyse the usage of ordinary biomedical terms as well as their variants (namely inflectional and orthographic alternatives, terms with prepositions, coordinated terms, etc.), showing the variability and dynamic nature of terms used in biomedical abstracts. Term coordination and terms containing prepositions are analysed in detail. We show that there is a discrepancy between terms used in literature and terms listed in controlled dictionaries. We also evaluate the effectiveness of incorporating different types of term variation into an automatic term recognition system.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2004) Learning to classify biomedical terms through literature mining and genetic algorithms. In Z.R. Yang et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2004. LNCS 3177, Springer, pp. 345-351
We present an approach to classification of biomedical terms based on the information acquired automatically from the corpus of relevant literature. The learning phase consists of two stages: acquisition of terminologically relevant contextual patterns (CPs) and selection of classes that apply to terms used with these patterns. CPs represent a generalisation of similar term contexts in the form of regular expressions containing lexical, syntactic and terminological information. The most probable classes for the training terms co-occurring with the statistically relevant CP are learned by a genetic algorithm. Term classification is based on the learnt results. First, each term is associated with the most frequently co-occurring CP. Classes attached to such CP are initially suggested as the term's potential classes. Then, the term is finally mapped to the most similar suggested class.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Reducing lexical ambiguity in Serbo-Croatian by using genetic algorithms. In P. Kosta et al. (Eds.): Investigations into Formal Slavic Linguistics. Linguistik International, Peter Lang, Frankfurt, pp. 287-298
This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora using genetic algorithms. The main aim is to use these constraints to automatically define local grammars that can be used to reduce lexical ambiguity usually found in an initially tagged text. A genetic algorithm for computation of the minimal representation of grammatical features of textual constituents is suggested. The algorithm incorporates two types of genes, dominant and recessive, which are specific for the features that are analysed. The resulting genetic structure describes the constraints that have to be fulfilled in order to form a correct utterance. As a case study, the suggested algorithm is applied on contexts of prepositional phrases, and features of corresponding noun phrases are obtained. The results obtained coincide with (theoretical) grammars that define the constraints for such noun phrases.
Irena Spasić, Goran Nenadić, Kostas Manios and Sophia Ananiadou (2002) Supervised learning of term similarities. In Hujun Yin et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2002. LNCS 2412, Springer, pp. 429-434
In this paper we present a method for the automatic discovery and tuning of term similarities. The method is based on the automatic extraction of significant patterns in which terms tend to appear. Beside that, we use lexical and functional similarities between terms to define a hybrid similarity measure as a linear combination of the three similarities. We then present a genetic algorithm approach to supervised learning of parameters that are used in this linear combination. We used a domain specific ontology to evaluate the generated similarity measures and set the direction of their convergence. The approach has been tested and evaluated in the domain of molecular biology.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Term clustering using a corpus-based similarity measure. In P. Sojka et al. (Eds.): Text, Speech and Dialogue - TSD 2002. LNAI 2448, Springer, pp. 151-154
In this paper we present a method for the automatic term clustering. The method uses a hybrid similarity measure to cluster terms automatically extracted from a corpus by applying the C/NC-value method. The measure comprises contextual, functional and lexical similarity, and it is used to instantiate the cell values in a similarity matrix. The clustering algorithm uses either the nearest neighbour or the Ward's method to calculate the distance between clusters. The approach has been tested and evaluated in the domain of molecular biology and the results are presented.
Irena Spasić and Gordana Pavlović-Lažetić (2001) Syntactic structures in a sublanguage of Serbian for querying relational databases. In G. Zybatow et al. (Eds.): Current Issues in Formal Slavic Linguistics. Peter Lang, Frankfurt/Main, pp. 478-488
This paper deals with syntactic structures identified in a sublanguage of Serbian for querying relational databases. Three levels of syntactic description of the sublanguage are defined: word, syntagmatic, and sentence levels. An algorithm for complete syntactic analysis of a Serbian language query over relational database and its translation into a formal SQL query is presented. An example of partial parsing and translation is discussed.
Goran Nenadić and Irena Spasić (2000) The recognition and acquisition of compound names from corpora. In D. Christodoulakis (Ed.): Natural Language Processing - NLP 2000. LNAI 1835, Springer, pp.38-48
In this paper we will present an approach to acquisition of some classes of compound words from large corpora, as well as a method for semi-automatic generation of appropriate linguistic models, that can be further used for compound word recognition and for completion of compound word dictionaries. The approach is intended for a highly inflective language such as Serbo-Croatian. Generated linguistic models are represented by local grammars.
Goran Nenadić and Irena Spasić (1999) The acquisition of some lexical constraints from corpora. In V. Matousek et al. (Eds.): Text, Speech and Dialogue - TSD 1999. LNAI 1692, Springer, pp. 115-120
This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora. Constraints that are discussed are related to grammatical features of a preposition and the corresponding noun phrase that constitute a prepositional phrase. The approach is based on the extraction of a textual environment of a preposition from a corpus, which is then tagged using the system of electronic dictionaries. An algorithm for computation of some kind of the minimal representation of grammatical features associated with the corresponding noun phrases is suggested. The resulting set of features describes the constraints that a noun phrase has to fulfil in order to form a correct prepositional phrase with a given preposition. This set can be checked against other corpora.

Refereed conference papers:

Kieran Evans, Andrew Jones, Alun Preece, Francisco Quevedo, David Rogers, Irena Spasić, Ian Taylor, Vlado Stankovski, Salman Taherizadeh, Jernej Trnkoczy, George Suciu, Victor Suciu, Paul Martin, Junchao Wang and Zhiming Zhao (2015) Dynamically reconfigurable workflows for time-critical applications, in Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science, Austin, Texas, USA

Lakhveer Bhachu, Larisa Soldatova, Irena Spasić and Kate Button (2014) Mobile application KneeCare to support knee rehabilitation, Science and Information Conference, London, UK

Mark Greenwood, Glyn Elwyn, Nick Francis, Alun Preece and Irena Spasić (2013) Automatic extraction of personal experiences from patients' blogs: A case study in chronic obstructive pulmonary disease, Third International Conference on Social Computing and its Applications, Karlsruhe, Germany, pp. 377-382

Peter Burnap, Irena Spasić, W. Alex Gray, Jeremy Hilton, Omer Rana and Glyn Elwyn (2012) Protecting patient privacy in distributed collaborative healthcare environments by retaining access control of shared information, in Proceedings of 14th International Conference on Collaboration Technologies and Systems, Denver, Colorado, USA, pp. 490-497 [Springer best paper runner up]
Access control and privacy policies change during the course of collaboration. Information is often shared with collaborators outside of the traditional "perimeterized" organizational computer network. At this point the information owner (in the legal data protection sense) loses persistent control over their information. They cannot modify the policy that controls who accesses it, and have that enforced on the information wherever it resides. However, if patient consent is withdrawn or if the collaboration comes to an end naturally, or prematurely, the owner may be required to withdraw further access to their information. This paper presents a system that enhances the way access control technology is currently deployed so that information owners retain control of their access control and privacy policies, even after information has been shared.
Hui Yang, Irena Spasić, John Keane and Goran Nenadić (2008) Combining lexical profiling, rules and machine learning for disease prediction from hospital discharge summaries, in Proceedings of 2nd i2b2 Shared-Task and Workshop Challenges in Natural Language Processing for Clinical Data, Washington DC, USA

Irena Spasić, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas Kell, Norman Paton and the Ontology Working Group Members (2007) Facilitating the development of controlled vocabularies for metabolomics with text mining, in ISMB/ECCB Special Interest Group (SIG) Meeting Program Materials, Bio-Ontologies SIG Workshop, Vienna, Austria, pp. 103-106
Bioinformatics applications heavily rely on controlled vocabularies and ontologies to consistently interpret and seamlessly integrate information scattered across disparate public resources. Experimental data from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. Here we describe the development of controlled vocabularies for metabolomics investigations. Manual term acquisition approaches are time-consuming, labour-intensive and error-prone, especially in a rapidly developing domain such as metabolomics, where new analytical techniques emerge regularly so that the domain experts are often compelled to use non-standardised terms. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature.
Goran Nenadić, Simon Rice, Irena Spasić, Sophia Ananiadou and Benjamin Stapley (2003) Selecting text features for gene name classification: from documents to terms, in Proceedings of ACL Workshop on Natural Language Processing in Biomedicine, Sapporo, Japan, pp. 121-128
In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2003) Using domain-specific verbs for term classification, in Proceedings of ACL Workshop on Natural Language Processing in Biomedicine, Sapporo, Japan, pp. 17-24
In this paper we present an approach to term classification based on verb complementation patterns. The complementation patterns have been automatically learnt by combining information found in a corpus and an ontology, both belonging to the biomedical domain. The learning process is unsupervised and has been implemented as an iterative reasoning procedure based on a partial order relation induced by the domain-specific ontology. First, term recognition was performed by both looking up the dictionary of terms listed in the ontology and applying the C/NC-value method. Subsequently, domain-specific verbs were automatically identified in the corpus. Finally, the classes of terms typically selected as arguments for the considered verbs were induced from the corpus and the ontology. This information was used to classify newly recognised terms. The precision of the classification method reached 64%.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Morpho-syntactic clues for terminological processing in Serbian, in Proceedings of EACL Workshop on Morphological Processing of Slavic Languages, Budapest, Hungary, pp. 79-86
In this paper we discuss morpho-syntactic clues that can be used to facilitate terminological processing in Serbian. A method (called srCe) for automatic extraction of multiword terms is presented. The approach incorporates a set of generic morpho-syntactic filters for recognition of term candidates, a method for conflation of morphological variants and a module for foreign word recognition. Morpho-syntactic filters describe general term formation patterns, and are implemented as generic regular expressions. The inner structure together with the agreements within term candidates are used as clues to discover the boundaries of nested terms. The results of the terminological processing of a textbook corpus in the domains of mathematics and computer science are presented.
Irena Spasić, Goran Nenadić, Kostas Manios and Sophia Ananiadou (2003) An integrated term-based corpus query system, in Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 243-250
In this paper we describe the X-TRACT workbench, which enables efficient term-based querying against a domain-specific literature corpus. Its main aim is to aid domain specialists in locating and extracting new knowledge from scientific literature corpora. Before querying, a corpus is automatically terminologically analysed by the ATRACT system, which performs terminology recognition based on the C/NC-value method enhanced by incorporation of term variation handling. The results of terminology processing are annotated in XML, and the produced XML documents are stored in an XML-native database. All corpus retrieval operations are performed against this database using an XML query language. We illustrate the way in which the X-TRACT workbench can be utilised for knowledge discovery, literature mining and conceptual information extraction.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Terminology-driven mining of biomedical literature, in Proceedings of 18th Annual ACM Symposium on Applied Computing, Melbourne, Florida, USA
Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective literature mining techniques that can help biologists to gather and make use of the knowledge encoded in text documents. Although the knowledge is organised around sets of domain-specific terms, few literature mining systems incorporate deep and dynamic terminology processing.

Results: In this paper, we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts recorded the precision of 98% and 71% respectively.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Automatic discovery of term similarities using pattern mining, in Proceedings of Second International Workshop on Computational Terminology - CompuTerm 2002, Taipei, Taiwan, pp. 43-49
Term recognition and clustering are key topics in automatic knowledge acquisition and text mining. In this paper we present a novel approach to the automatic discovery of term similarities, which serves as a basis for both classification and clustering of domain-specific concepts represented by terms. The method is based on automatic extraction of significant patterns in which terms tend to appear. The approach is domain independent: it needs no manual description of domain-specific features and it is based on knowledge-poor processing of specific term features. However, automatically collected patterns are domain specific and identify significant contexts in which terms are used. Beside features that represent contextual patterns, we use lexical and functional similarities between terms to define a combined similarity measure. The approach has been tested and evaluated in the domain of molecular biology, and preliminary results are presented.
Sophia Ananiadou, Goran Nenadić, Dietrich Schuhmann and Irena Spasić (2002) Term-based literature mining from biomedical texts, ISMB Text Data Mining SIG, Edmonton, Canada

Irena Spasić, Goran Nenadić and Sophia Ananiadou (2002) Tuning context features with genetic algorithms, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2048-2054
In this paper we present an approach to tuning of context features acquired from corpora. The approach is based on the idea of a genetic algorithm (GA). We analyse a whole population of contexts surrounding related linguistic entities in order to find a generic property characteristic of such contexts. Our goal is to tune the context properties so as not to lose any correct feature values, but also to minimise the presence of ambiguous values. The GA implements a crossover operator based on dominant and recessive genes, where a gene corresponds to a context feature. A dominant gene is the one that, when combined with another gene of the same type, is inevitably reflected in the offspring. Dominant genes denote the more suitable context features. In each iteration of the GA, the number of individuals in the population is halved, finally resulting in a single individual that contains context features tuned with respect to the information contained in the training corpus. We illustrate the general method by using a case study concerned with the identification of relationships between verbs and terms complementing them. More precisely, we tune the classes of terms that are typically selected as arguments for the considered verbs in order to acquire their semantic features.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Automatic acronym acquisition and management within domain-specific texts, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2155-2162
In this paper we present a framework for the effective management of terms and their variants that are automatically acquired from domain-specific texts. In our approach, the term variant recognition is incorporated in the automatic term retrieval process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in scientific papers. We describe a method for the automatic acquisition of newly introduced acronyms and the mapping to their 'meanings', i.e. the corresponding terms. The proposed three-step procedure is based on morpho-syntactic constraints that are commonly used in acronym definitions. First, acronym definitions containing an acronym and the corresponding term are retrieved. These two elements are matched in the second step by performing morphological analysis of words and combining forms constituting the term. The problems of acronym variation and acronym ambiguity are addressed in the third step by establishing classes of term variants that correspond to specific concepts. We present the results of the acronym acquisition in the domain of molecular biology: the precision of the method ranged from 94% to 99% depending on the size of the corpus used for evaluation, whilst the recall was 73%.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2002) A genetic algorithm approach to unsupervised learning of context features, in Proceedings of 5th National Colloquium for Computational Linguistics in the UK, Leeds, UK, pp. 12-19
We present an approach to unsupervised learning of some context features from corpora. The approach uses the idea of genetic algorithms. The algorithm operates on collection of related linguistic entities as opposed to an isolated linguistic entity. Each of the entities encodes the values for predefined set of context features obtained by automatic tagging. Our goal is to refine these features in order to find an interpretation that is optimal in the sense that it does not lose any correct feature values, but which, on the other hand, minimises the presence of feature values that are not applicable in a specific context. Our genetic algorithm implements a novel crossover operator based on two types of genes, dominant and recessive, where a gene corresponds to a context feature.
Dubravka Pavličić and Irena Spasić (2001) The effects of irrelevant alternatives on the results of the TOPSIS method, in Proceedings of XXVIII Yugoslav Symposium on Operational Research SYM-OP-IS 2001, Belgrade, Serbia

Irena Spasić and Gordana Pavlović-Lažetić (2001) Object-oriented modelling in natural language communication with a relational database, in Selected Papers from 10th Congress of Yugoslav Mathematicians, Belgrade, Serbia, pp. 343-347
This paper describes the problems of developing a natural language interface towards a relational database (RDB). These problems depend on a particular database, or, more precisely, on a specific semantic domain that is modeled by the RDB. The most obvious dependency is the one reflected in the structure of the RDB, that is - the actual tables, attributes and their relationships. This information is recorded in the RDB catalogue, which can be used for the automatic generation of an OO model of the RDB. The classes of that model may serve the purpose of supporting the information extracted from a natural language query (NLQ). Possible ambiguities are gradually reduced by using the IsA relationships between the classes. If this still leaves the ambiguity unresolved, then it is possible to automatically generate a menu corresponding to the class that is the source of the ambiguity. The structure of the menu is in accordance with the OO model of the RDB.
Olgica Bošković and Irena Spasić (1999) Graph theory and log-linear models, in Proceedings of XXVI Yugoslav Symposium on Operational Research SYM-OP-IS '99, Belgrade, Serbia

Irena Spasić (1996) Automatic foreign words recognition in a Serbian scientific or technical text, in Proceedings of Conference on Standardization of Terminology, Serbian Academy of Arts and Sciences, Belgrade, Serbia


Irena Spasić and Predrag Janičić (2000) Theory of Algorithms, Languages and Automata. Faculty of Mathematics, Belgrade, Serbia

Miodrag Ivović, Branislav Boričić, Dragan Azdejković and Irena Spasić (1998) Practice Book in Mathematics. Faculty of Economics, Belgrade, Serbia

Miodrag Ivović, Branislav Boričić, Velimir Pavlović, Dragan Azdejković and Irena Spasić (1996) Mathematics through Examples and Exercises with Elements of Theory. Faculty of Economics, Belgrade, Serbia
Expert Systems with Applications
IF = 2.571
#1 in Artificial Intelligence
Expert Systems with Applications
Nature Biotechnology
IF = 32.4
Nature Biotechnology
Nature Reviews Microbiology
IF = 22.49
Nature Reviews Microbiology
International Journal of Epidemiology
IF = 7.001
Nature Reviews Microbiology
Briefings in Bioinformatics
IF = 5.202
>240 citations
Briefings in Bioinformatics
IF = 6.911
BMC Bioinformatics
IF = 3.02
BMC Bioinformatics
Journal of Biomedical Semantics
IF = 2.54
Journal of Biomedical Semantics
IF = 4.23
FEBS Letters
IF = 3.582
FEBS Letters
IF = 3.571
#2 in Medical Informatics
IF = 4.433
International Journal of Medical Informatics
IF = 2.700
International Journal of Medical Informatics
Journal of Biomedical Informatics
IF = 2.131
Journal of Biomedical Informatics