Plenary Lecture, ARTIFICIAL INTELLIGENCE, KNOWLEDGE ENGINEERING and DATA BASES (AIKED '09), Cambridge, UK, February 21-23, 2009

Plenary Lecture

Advanced Methods for Text Retrieval

Professor Ioana Moisil
Co-author: Lucian Blaga
Department of Computer Science and Automatic Control
Hermann Oberth Faculty of Engineering
Lucian Blaga University of Sibiu
Blvd. Victoriei 10, 550024 Sibiu

Abstract: Information retrieval (IR) is one of the most challenging fields of study. Today we are defining information retrieval as the interdisciplinary science of searching for documents, for information within documents and for documents’ metadata in databases and on the World Wide Web. For many decades IR was concerned with finding the needed information from large collections of text documents. The explosive growth of the use of digital multimedia, from image and graphics to audio and video files over the Internet and wireless communications, or stored on DVDs and CD-ROMs, has determined the development of a specific sub-field of IR, i.e. multimedia information retrieval. In this lecture I will focus only on text retrieval methods. The retrieving unit is the document and documents from a collection are extracted to form the text data base. The documents are Web pages, if we are searching the Web. The retrieving process is simple but not at all trivial: a user is issuing a query; the retrieving system is then finding a set of documents relevant to the user’s query; the selected documents are ranked by relevance scores. The process can be tuned manually or automatically. The most wide used format for the query is a list of keywords, but other formats can be used: Boolean queries, phrase queries, proximity queries, full document, and natural language questions. There are several aspects that turn the retrieving process from an apparent simple process into a very complex and challenging one. First, the tremendous success of the Web has transformed it in the most important information source, shadowing traditional and digital libraries. That means we have to retrieve information from a countable but almost infinite collection of documents. For example, the number of pages in Google's index is growing at an amazing rate - while it started with 26 million pages in 1998, it had last year a trillion of pages. So the first challenge is to use retrieval methods that lead to a rapid response. The second challenge is linked to relevance, and the concern to reduce information overload. We will refer these aspects all along the presentation.
In the first part of this lecture I will critically discuss the most used information retrieval models, from the ones based on set theory (standard and extended Boolean model, fuzzy retrieval model) to more recent ones: algebraic models (the vector space model – VSM and extensions: TVSM, latent semantic analysis, term discrimination, DSIR model). Some probabilistic models will be also presented.
The second part of the lecture will discuss relevance feedback and performance measures. In the third part the need of text and Web pages pre-processing will be emphasised.
Instead of conclusions we will discuss the impact on text retrieval of two innovative technologies: semantic Web and Web services and of the Web 2.0 paradigm.

Brief Biography of the Speaker:
Ioana Moisil received the M.Sc. in Mathematics at the University of Bucharest, in 1971, the scientific grade in Statistical, Epidemiological and Operation Research Methods Applied in Public Health and Medicine at the Universite Libre de Bruxelles, in Belgium, in 1991 and the Ph.D. in Mathematics at the Romanian Academy in 1997. Work places: the National Institute for Research & Development in Informatics - I.C.I (1971-1986), Carol Davila Faculty of Medicine Bucharest – department of Biophysics, CCSSDM Center of the Ministry of Health. At present she is a full-time Professor and a Senior Researcher at the Department of Computer Science and Automatic Control – Faculty of Engineering at the “Lucian Blaga” University of Sibiu. She is the author/co-author of fourteen books and over 150 scientific papers. Her scientific interests include intelligent systems, healthcare telematics, web technologies, data-mining, e-learning, modelling and simulation, uncertainty management, human-computer interaction. Professor Moisil participated in several EU funded projects as project manager for the national partner (Telenurse ID ENTITY, MGT, PROPRACTITION, PRO-ACCESS), in Tempus projects and in national funded projects as research manager and software development coordinator (INFOSOC – eUNIV, AMTRANS – eCASTOR, INFOSOC - e-Scribe, INFOSOC – DANTE, e-EDU-Quality, eTransMobility, CNCSIS 2007-code 33, Studies on multivariate interpolation, polinomial classifiers and applications, CNCSIS 2007 – cod 1502, Aspects concerning the psycho-cognitive abilities of artificial intelligent agents and applications in ITC based education). Current research is oriented on information retrieval, meta-heuristics, advanced classification methods. Ioana Moisil is a member of EARLI (European Association for Research in Learning and Instruction), she is Romanian representative in the IMIA SIG and EFMI WG5 Nursing Informatics, honorary member of the Bohemian Medical Association J.E.Purkyne of Bio-engineering and Medical Informatics, member of the ISCB – International Society for Clinical Biostatistics – Romanian National Group, of the Romanian Association of Engineers, member of the IITM- International Institute of Tele-Medicine and of the Romanian Society of Mathematics Sciences. She is vice-president of the Romanian Medical Informatics Society; vice-president of the HIT Foundation for Health Informatics and Telematics and a member of RoCHI-ACM. Professor Moisil is taking part in several international peer-review committees and conferences scientific boards.



WSEAS Unifying the Science