An investigation of linguistic problems in automatic multi-document summaries / Uma investigação de problemas linguísticos em sumários automáticos multidocumento

Márcio de Souza Dias; Ariani Di Felippo; Amanda Pontes Rassi; Paula Christina Figueira Cardoso; Fernando Antônio Asevedo Nóbrega; Thiago Alexandre Salgueiro Pardo

doi:10.17851/2237-2083.29.2.859-907

An investigation of linguistic problems in automatic multi-document summaries / Uma investigação de problemas linguísticos em sumários automáticos multidocumento

Márcio de Souza Dias, Ariani Di Felippo, Amanda Pontes Rassi, Paula Christina Figueira Cardoso, Fernando Antônio Asevedo Nóbrega, Thiago Alexandre Salgueiro Pardo

Abstract

Abstract: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.

Keywords: automatic summarization; multi-document summary; linguistic problem; corpus annotation.

Resumo: Sumários automáticos geralmente apresentam vários problemas linguísticos que afetam a sua qualidade textual e, consequentemente, sua compreensão pelos usuários. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarização. Neste artigo, investigaram-se os problemas em extratos (isto é, sumários produzidos pela concatenação de sentenças extraídas na íntegra dos textos-fonte) multidocumento em Português do Brasil gerados por sistemas que apresentam diferentes abordagens (isto é, superficial e profunda) e desempenho (isto é, métodos baseline e do estado-da-arte). Para tanto, as principais caracterizações dos problemas linguísticos em sumários automáticos foram investigadas, resultando em uma tipologia mais adequada à sumarização multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas são significativamente mais recorrentes que outros. Assim, essa anotação gera subsídios para as tarefas automáticas de detecção e correção de problemas linguísticos com vistas à produção de sumários automáticos não só mais informativos (isto é, que cobrem o conteúdo do material de origem), como também linguisticamente bem-estruturados.

Palavras-chave: sumarização automática; sumário multidocumento; problema linguístico; anotação de corpus.

Keywords

automatic summarization; multi-document summary; linguistic problem; corpus annotation; sumarização automática; sumário multidocumento; problema linguístico; anotação de corpus.

Full Text:

PDF

References

ANDO, R.; BOGURAEV, B.; BYRD, R.; NEFF, M. Multi-document Summarization by Visualizing Topical Content. In: ANLP/NAACL WORKSHOP ON AUTOMATIC SUMMARIZATION, 2000, New Brunswick. Proceedings […]. New Brunswick: Association for Computational Linguistics, 2000. p. 79-88. DOI: https://doi.org/10.3115/1117575.1117584

BEAUGRANDE, R.; DRESSLER, W. U. Introduction to Text Linguistics. 1. ed. London: Longman, 1981.

CARBONELL, J.; GENG, Y.; GOLDSTEIN, J. Automated Query-Relevant Summarization and Diversity-Based Reranking. In: IJCAI Workshop on AI in Digital Libraries, 1997, Nagoya. Proceedings […]. Nagoya: [s.n.], 1997. p. 12-19.

CARDOSO, P. C. F; MAZIERO, E.; JORGE, M.; SENO, E.; DI-FELIPPO, A.; RINO, L.; NUNES, M.; PARDO, T. A. S. CSTNews: A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese. In: RST BRAZILIAN MEETING, 3., 2011, Cuiabá. Proceedings […]. Cuiabá: Sociedade Brasileira de Computação, 2011. p. 88-105.

CARDOSO, P. C. F.; PARDO, T. A. S. Joint Semantic Discourse Models for Automatic Multi-Document Summarization. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 10., 2015, Natal. Proceedings […]. Natal: Sociedade Brasileira de Computação, 2015. p. 81-90.

CARDOSO, P. C. F.; PARDO, T. A. S. Multi-Document Summarization Using Semantic Discourse Models. Procesamiento de Lenguaje Natural, Jaén, Espanha, v. 56, n. 1, p. 57-64, 2016.

CARLETTA, J. Assessing Agreement on Classification Tasks: The Kappa Statistic. Computational Linguistics, Cambridge, v. 22, n. 2, p. 249-254, 1996.

CASTRO JORGE, M. L. R. Modelagem gerativa para sumarização automática multidocumento. 2015. 151f. Tese (Doutorado em Ciência de Computação e Matemática Computacional) – Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, 2015.

CASTRO JORGE, M. L. R.; PARDO, T. A. S. Experiments with CST-based Multidocument Summarization. In: ACL WORKSHOP: GRAPH-BASED METHODS FOR NATURAL LANGUAGE PROCESSING, 5., 2010, Uppsala, Sweden. Proceedings of TextGraphs-5 […]. Uppsala: Association for Computational Linguistics, 2010. p. 74-82.

CONROY, J. M.; SCHLESINGER, J. D.; KUBINA, J.; RANKEL, P. A.; O’LEARY, D. P. CLASSY 2011 at TAC: Guided and Multilingual Summaries and Evaluation Metrics. In: TEXT ANALYSIS CONFERENCE, 4., 2011, Maryland. Proceedings […]. Maryland: NIST, 2011. p. 1-8.

CRISTINI, L. F.; DI-FELIPPO, A. Violações linguísticas em referências a entidades do tipo “pessoa” em extratos automáticos multidocumento. In: WORKSHOP ON PORTUGUESE DESCRIPTION, 6., 2019, Salvador. Proceedings […]. Salvador: [s.n], 2019. p. 244-252.

DANG, H. T. Overview of DUC 2005. In: DOCUMENT UNDERSTANDING CONFERENCE, 2005, Vancouver. Proceedings […]. Vancouver: NIST, 2005 p. 1-12. Available on: https://duc.nist.gov/pubs.html#2005. Retrieved at: January. 2015.

FONSECA, H. P. A.; DIAS, M. S.; SILVA, N. F. F. Identificação automática de erros em sumários multidocumento. In: SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 12., 2019, Salvador. Anais… Salvador: Brazilian Computer Society, 2019. p. 395-399.

FRIEDRICH, A.; VALEEVA, M.; PALMER, A. LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 9., 2014, Reykjavik. Proceedings […]. Reykjavik: European Language Resources Association, 2014. p. 1591-1599.

GIANNAKOPOULOS, G.; KARKALETSIS, V. AutoSummENG and MeMoG in Evaluating Guided Summaries. In: TEXT ANALYSIS CONFERENCE, 4., 2011, Maryland. Proceedings […]. Maryland: NIST, 2011. p. 1-10.

HAGHIGHI, A.; VANDERWENDE, L. Exploring Content Models for Multi-Document Summarization. In: HUMAN LANGUAGE TECHNOLOGIES: THE ANNUAL CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ACL, 2009, Boulder. Proceedings […]. Boulder: NACL, 2009. p. 362-370. DOI: https://doi.org/10.3115/1620754.1620807

HOVY, E. H.; LAVID, J. M. Towards a Science of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics. International Journal of Translation Studies, [S.l.], v. 22, n. 1, p. 13-36, 2010.

KASPERSSON, T.; SMITH, C.; DANIELSSON, H.; JÖNSSON, A. This Also Affects the Context – Errors in Extraction Based Summaries. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 8., 2012, Istanbul. Proceedings […]. Istanbul: European Language Resources Association, 2012. p. 173-178.

KOCH, I. G. V. A coesão textual. 10. ed. São Paulo: Contexto, 1998.

KOCH, I. G. V.; TRAVAGLIA, L. C. A coerência textual. São Paulo: Contexto, 2002.

LIN, C-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: ACL WORKSHOP ON TEXT SUMMARIZATION BRANCHES OUT, 2004, Barcelona. Proceedings […]. Barcelona: ACL, 2004. p. 74-81.

LIN, Z.; LIU, C.; NG, H. T.; KAN, M. Combining coherence models and machine translation evaluation metrics for summarization evaluation. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 50., 2012, Jeju Island. Proceedings […]. Jeju Island: ACL, 2012. p. 1006-1014.

MANI, I. Automatic Summarization. Amsterdam: John Benjamins Publishing, 2001.

MANI, I.; MAYBURY, M. T. Advances in Automatic Text Summarization. Cambridge: The MIT Press. 1999. DOI: https://doi.org/10.1075/nlp.3

MANN, W. C.; THOMPSON, S. A. Rhetorical Structure Theory: A Theory of Text Organization. Technical Report ISI/RS-87-190, 1987. Available on: https://www.sfu.ca/rst/05bibliographies/bibs/ISI_RS_87_190.pdf. Retrieved at: March. 2015.

MARCU, D. Discourse Trees Are Good Indicators of Importance in Text. In: MANI, I.; MAYBURY, M. T. (ed.). Advances in Automatic Text Summarization. Cambridge: The MIT Press, 1999. 123-136.

MCKEOWN, K.; RADEV, D. R. Generating Summaries of Multiple News Articles. In: ANNUAL INTERNATIONAL ACM-SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 18., 1995, Seattle. Proceedings […]. Seatle: Association for Computing Machinery, 1995. p. 74-82. DOI: https://doi.org/10.1145/215206.215334

MIHALCEA, R.; TARAU, P. An Algorithm for Language Independent Single and Multiple Document Summarization. In: INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, 2., 2005, Jeju Island. Proceedings […]. Jeju Island: ACL, 2005. p. 19-24. DOI: https://doi.org/10.1007/11562214

NENKOVA, A.; MCKEOWN, K. R. Automatic Summarization. Foundations and Trends in Information Retrieval. Hanover, MA: Now Publishers, 2011. DOI: https://doi.org/10.1561/1500000015

OLIVEIRA, P. C. F. de. CatolicaSC at TAC 2011. In: TEXT ANALYSIS CONFERENCE (TAC), 4., 2011, Gaithersburg. Proceedings […]. Gaithersburg: NIST, 2011. p. 1-3.

OTTERBACHER, J. C.; RADEV, D. R.; LUO, A. Revisions that Improve Cohesion in Multi-Document Summaries: A Preliminary Study. In: ACL-02 WORKSHOP ON AUTOMATIC SUMMARIZATION, 2002, Philadelphia. Proceedings […]. Philadelphia: ACL, 2002. p. 27-36. DOI: https://doi.org/10.3115/1118162.1118166

OWCZARZAK, K.; DANG T. H. Overview of the TAC 2011 Summarization Track: Guided task and AESOP task. In: TEXT ANALYSIS CONFERENCE, 3., 2011, Gaithersburg. Proceedings […]. Gaithersburg: NIST, 2010. Available on: https://tac.nist.gov/2011/Summarization/Guided-Summ.2011.guidelines.html. Retrieved at: January. 2015.

PARDO, T. A. S.; RINO, L. H. M.; NUNES, M. G. V. GistSumm: A Summarization Tool Based on a New Extractive Method. In: WORKSHOP ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 6., 2003, Faro, Portugal. Proceedings […]. Faro: Springer, 2003. p. 210-218. DOI: https://doi.org/10.1007/3-540-45011-4_34

PARDO, T. A. S. GistSumm - GIST SUMMarizer: extensões e novas funcionalidades. Technical Report NILC-TR-05-05, 2005. Available on: https://sites.icmc.usp.br/taspardo/NILCTR0505-Pardo.pdf. Retrieved at: January. 2015.

PITLER, E.; LOUIS, A.; NENKOVA, A. Automatic Evaluation of Linguistic Quality in Multi-document Summarization. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 48., 2010, Uppsala, Sweden. Proceedings […]. Uppsala: ACL, 2010. p. 544-554.

RADEV, D. R. A Common Theory of Information Fusion from Multiple Text Sources, Step One: Cross-document Structure. In: ACL SIGDIAL WORKSHOP ON DISCOURSE AND DIALOGUE, 1., 2000, Hong Kong. Proceedings […]. Hong Kong: ACL, 2000. p. 74-83. DOI: https://doi.org/10.3115/1117736.1117745

RADEV, D. R.; TEUFEL, S.; SAGGION, H.; LAM, W.; BLITZER, J.; CELEBI, A.; QI, H.; LIU, D.; DRABEK, E. Evaluation Challenges in Large-Scale Multi-Document Summarization. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 41., 2003, Sapporo, Japan. Proceedings […]. Sapporo: ACL, 2003. p. 375-382. DOI: https://doi.org/10.3115/1075096.1075144

RIBALDO, R.; AKABANE, A. T.; RINO, L. H. M.; PARDO, T. A. S. Graph-based Methods for Multi-Document Summarization: Exploring Relationship Maps. Complex Networks and Discourse Information. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF PORTUGUESE, 10., 2012, Coimbra. Proceedings (Lecture Notes in Computer Science 7243) […]. Coimbra: Springer, 2012. p. 260-271. DOI: https://doi.org/10.1007/978-3-642-28885-2_30

RIBALDO, R.; CARDOSO, P. C. F.; PARDO, T. A. S. Exploring the Subtopic-Based Relationship Map Strategy for Multi-Document Summarization. Journal of Theoretical and Applied Computing (RITA), Porto Alegre, RS, v. 23, n. 1, p. 183-211, 2016. DOI: https://doi.org/10.22456/2175-2745.59104

SALTON, G.; SINGHAL, A.; MITRA, M.; BUCKLEY, C. Automatic Text Structuring and Summarization. Information Processing & Management, [S.l.], v. 33, n. 2, p. 193-207, 1997. DOI: https://doi.org/10.1016/S0306-4573(96)00062-3

ZHANG, Z.; GOLDENSHON, S. B.; RADEV, D. R. Towards CST-Enhanced Summarization. In: NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, 18., 2002, Menlo Park, CA. Proceedings […]. Menlo Park: AAAI, 2002. p. 439-445.

DOI: http://dx.doi.org/10.17851/2237-2083.29.2.859-907