RAPPORT: A Fact-Based Question Answering System for Portuguese

Rodrigues, Ricardo Manuel da Conceição

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/41880

Title:	RAPPORT: A Fact-Based Question Answering System for Portuguese
Authors:	Rodrigues, Ricardo Manuel da Conceição
Orientador:	Machado, Fernando Gomes, Paulo
Keywords:	Natural Language Processing; Question Answering; Processamento de linguagem natural; Resposta automática de perguntas
Issue Date:	15-Nov-2017
Citation:	RODRIGUES, Ricardo Manuel da Conceição - RAPPORT: a fact-based question answering system for portuguese. Coimbra : [s.n.], 2017. Tese de doutoramento. Disponível na WWW: http://hdl.handle.net/10316/41880
Abstract:	Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered more common these days, the same still does not happen regarding access to specific textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user questions with small passages or short answers. The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level, although the syntactic aspects of natural language have well known rules, the size and complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest tasks to handle. There is also the need to correctly process the question in order to define its target, and select and process the answers found in a text. Additionally, the selected text that may yield the answer to a given question must be further processed in order to present just a passage instead of the full text. These issues take also longer to address in languages other than English, as is the case of Portuguese, that have a lot less people working on them. This work focuses on question answering for Portuguese. In other words, our field of interest is in the presentation of short answers, passages, and possibly full sentences, but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPort, built upon the use of open information extraction techniques for extracting triples, so called facts, characterizing information on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other metadata, constitute the basis of the answers presented by the system. Facts work both by storing short and direct information in a text, typically entity related information, and by containing in themselves the answers to the questions already in the form of small passages. As for the results, although there is margin for improvement, they are a tangible proof of the adequacy of our approach and its different modules for storing information and retrieving answers in question answering systems. In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LemPORT, which was built from scratch, and has a high accuracy. Most of these tools result from the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their input, post-processing their output, or both, and by training models for use in those tools or other, such as MaltParser. Other tools include the creation of interfaces for other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules. A resposta automática a perguntas é um dos problemas existentes de há longa data na área de processamento de linguagem natural. Embora interfaces em linguagem natural para sistemas informáticos possam ser consideradas mais comuns hoje em dia, o mesmo ainda não acontece no que toca ao acesso a informação específica em formato textual. Qualquer motor de busca de texto integral pode facilmente recuperar documentos que contenham termos especificados pelo utilizador ou estreitamente relacionados, mas é, regra geral, incapaz de responder a perguntas de utilizadores com passagens directas ou pequenos excertos. O grande problema com a resposta automática a perguntas deve-se a o texto ser difícil de processar, tanto pela sua estrutura sintáctica como, num grau mais elevado, pelo conteúdo semântico. Ao nível das frases, embora os aspectos sintácticos da linguagem natural tenham regras bem conhecidas, o tamanho e a complexidade de uma frase podem tornar difícil a análise da sua estrutura. Ainda para mais, aspectos semânticos são ainda difíceis de tratar, com a ambiguidade do texto a ser uma das tarefas mais difíceis de abordar. Há também a necessidade de processar correctamente as questões, a fim de definir os seus alvos, e seleccionar e processar as respostas encontradas no texto. Para além disso, o texto seleccionado que pode conter a resposta a uma dada questão deve ainda ser processado de forma a apresentar apenas uma passagem, em vez do texto completo. Estes problemas são ainda de resolução mais lenta noutros idiomas que não o Inglês, como é o caso do Português, dado que que têm muito menos pessoas a debruçarem-se sobre eles. Este trabalho tem como foco a resposta automática a perguntas para o português. Por outras palavras, o nosso campo de interesse situa-se na obtenção de respostas curtas, excertos, ou eventualmente frases, mas não necessariamente documentos inteiros, às perguntas formuladas usando linguagem natural. Para esse efeito, desenvolvemos um sistema, o RAPPort, baseado em técnicas de extração de informação aberta para a obtenção triplos ou factos que descrevam informação existente em texto, passando, de seguida, ao armazenamento e consulta desses mesmos triplos para responder a perguntas do utilizador feitas em linguagem natural. Estes factos, sob a forma de sujeito, predicado e objecto, para além de outros metadados, constituem a base para as respostas apresentadas pelo sistema. Tanto funcionam armazenando apenas o que pode ser considerado informação relevante, tipicamente relacionada com entidades, num texto, como contendo respostas às perguntas na forma de passagens curtas. Quanto aos resultados, apesar de ainda haver margem para melhoria, são uma constatação da adequação da nossa abordagem e dos respectivos módulos para o armazenamento de informação e para a obtenção de respostas em sistemas de resposta automática a perguntas. Neste processo, para além de uma nova abordagem para a resposta automática a perguntas para o Português, e da validação da aplicação de factos obtidos através de extracção de informação aberta à resposta automática a perguntas, desenvolvemos ferramentas que têm sido utilizadas noutros trabalhos relacionados com processamento de linguagem natural, tal como é o caso de um analisador morfológico, o LemPORT, que foi construído a partir do zero, e possui uma precisão muito elevada. Dessas ferramentas, a maioria delas resulta de melhorias efectuadas sobre aquelas encontradas no Apache OpenNLP, através do pré-processamento sua entrada, pós-processamento da sua saída, ou ambos, e do treino de modelos para utilização nessas ferramentas e outras, tais como o MaltParser. Outras ferramentas incluem a criação de interfaces para outros recursos que contêm, por exemplo, sinónimos, hiperónimos, hipónimos, ou a criação de listas de, por exemplo, relações entre os verbos e agentes, usando regras.
Description:	Tese de doutoramento em Ciências e Tecnologias da Informação, apresentada ao Departamento de Engenharia Informática da Faculdade de Ciências e Tecnologia da Universidade de Coimbra
URI:	https://hdl.handle.net/10316/41880
Rights:	openAccess
Appears in Collections:	FCTUC Eng.Informática - Teses de Doutoramento

Files in This Item:

File	Description	Size	Format
RAPPORT.pdf		2.15 MB	Adobe PDF	View/Open

Show full item record

Page view(s) 20

673

checked on Jul 16, 2024

Download(s) 20

1,280

checked on Jul 16, 2024

Google Scholar^TM

Check

This item is licensed under a Creative Commons License

Files in This Item:

Page view(s) 20

Download(s) 20

Google ScholarTM

Google Scholar^TM