Sujeito oculto às claras: uma abordagem descritivo-computacional / Omitted subjects revealed: a quantitative-descriptive approach

Cláudia Freitas; Elvis de Souza

doi:10.17851/2237-2083.29.2.1033-1058

Sujeito oculto às claras: uma abordagem descritivo-computacional / Omitted subjects revealed: a quantitative-descriptive approach

Cláudia Freitas, Elvis de Souza

Abstract

Resumo: Neste trabalho, apresentamos estudos descritivos e computacionais relacionados ao sujeito oculto. Em um primeiro momento, realizamos uma descrição de cunho quantitativo, tomando por base três corpora dos gêneros jornalístico, literário e enciclopédico. Especificamente, quantificamos o sujeito oculto em cada um dos corpora, e encontramos sujeitos omitidos em 24%, 41% e 46% das orações, respectivamente. Em um segundo momento, por meio de uma estratégia baseada em regras, reconstituímos esses sujeitos e os devolvemos aos corpora, com o objetivo de avaliar o quanto a omissão do sujeito é capaz de impactar o aprendizado automático de dependências sintáticas. Os resultados indicam que a reconstituição formal do sujeito pode melhorar a aprendizagem das dependências sintáticas em até 2% quando consideramos a métrica CLAS, evidenciando o papel relevante da modelagem linguística no aprendizado automático.

Palavras-chave: descrição linguística; sujeito oculto; omissão de sujeito; dependências sintáticas; linguística computacional; aprendizado de máquina; linguística de corpus.

Abstract: In this paper, we present descriptive and computational studies related to omitted subjects. Firstly, we develop a quantitative descriptive study based on three corpora, which consist of journalistic, literary and encyclopedic genres. Specifically, we quantify the omitted subjects in sentences for each of these corpora; omitted subjects were found in 24%, 41% and 46% of their sentences, respectively. Secondly, applying rule-based strategies, we reconstitute those subjects and place them back to the corpora, with the goal of evaluating how much the omission of subjects can impact the automatic learning of syntactic dependencies. The results indicate that the formal subject reconstitution can enhance the learning of syntactic dependencies in up to 2% according to the CLAS metric, highlighting the relevant role of linguistic modeling in the automatic learning process.

Keywords: linguistic description; omitted subject; syntactic dependencies; computational linguistics; machine learning; corpus linguistics.

Keywords

descrição linguística; sujeito oculto; omissão de sujeito; dependências sintáticas; linguística computacional; aprendizado de máquina; linguística de corpus; linguistic description; omitted subject; syntactic dependencies; computational linguistics; machi

Full Text:

PDF (Português (Brasil))

References

AFONSO, S.; BICK, E.; HABER, R.; SANTOS, D. Floresta sintá(c)tica: A Treebank for Portuguese. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2002), 3rd, 2002, Las Palmas de Gran Canaria. Proceedings […]. Las Palmas de Gran Canaria: ELRA, 2002. p. 1698-1703.

BICK, E. The parsing system palavras: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework. Aahus, Dinamarca: Aarhus Universitetsforlag, 2000.

DURAN, M. S.; ALUÍSIO, S. M. Propbank-Br: a Brazilian Treebank Annotated with Semantic Role Labels. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 12), 8th., 2012, Istambul, Proceedings […]. Istambul: ELRA, 2012. p. 1862-1867.

ELSON, D.; MCKEOWN K. Automatic Attribution of Quoted Speech in Literary Narrative. In: CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI 10), 24th., 2010, Atlanta, Proceedings […]. Atlanta: The AAAI Press, 2010. p. 1013-1019.

FINATTO, M. J.; SCARTON, C.; ROCHA, A.; ALUÍSIO, S. Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero. In: 8TH BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL 2011), 8th., 2011, Cuiabá, Proceedings […]. Cuiabá: SBC, 2011. p. 49-58.

FREITAS. C.; ROCHA, P.; BICK, E. Um mundo novo na Floresta Sintá(c)tica – o treebank do Português. Calidoscópio, São Leopoldo, RS, v. 6, n. 3, p. 142-148, 2008. DOI: https://doi.org/10.4013/cld.20083.03

HARTMANN, N. S.; DURAN, M. S.; ALUÍSIO, S. M. Filling the Gap: Inserting an Artificial Constituent Where a Subject Is Omitted in Portuguese. In: WORKSHOP ON TOOLS AND RESOURCES FOR AUTOMATICALLY PROCESSING PORTUGUESE AND SPANISH (TORPOR), I., São Carlos, Proceedings […]. São Carlos: SBC, 2014. Disponível em: http://www.nilc.icmc.usp.br/semanticnlp/includes/projects/brazilis/artigos/ToRPorEsp,%202014.pdf. Acesso em: 8 out. 2020.

HIGUCHI, S.; SANTOS, D.; FREITAS, C.; RADEMAKER, A. Distant Reading Brazilian Politics. In: CONFERENCE OF THE ASSOCIATION DIGITAL HUMANITIES IN THE NORDIC COUNTRIES (DHN 2019), 4th., 2019, Copenhagen. Proceedings […]. Copenhagen: University of Copenhagen, 2019. p. 190-200.

JONES, K. S. Computational Linguistics: What about the Linguistics? Computational Linguistics, Cambridge, MA, v. 33, n. 3, p. 437-441, 2007. DOI: https://doi.org/10.1162/coli.2007.33.3.437

LUFT, C. P. Moderna gramática brasileira. Rio de Janeiro: Globo Livros, 2002.

MARCUS, M.; SANTORINI, B.; MARCINKIEWICZ, M. A. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, Cambridge, MA, v. 19, n. 2, p. 313-330, 1993. DOI: https://doi.org/10.21236/ADA273556

NIVRE, J.; de MARNEFFE, M.C.; GINTER, F.; GOLDBERG, Y.; HAJIČ, J.; MANNING, C.D.; McDONALD, R.; PETROV, S.; PYYSALO, S.; SILVEIRA, N.; TSARFATY, R.; ZEMAN, D. Universal Dependencies v1: A Multilingual Treebank Collection. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC’16), 10th., Portorož, Proceedings [...]. Portorož: ELRA, 2016. p. 1659-1666.

NIVRE, J.; FANG, C. Universal Dependency Evaluation. In: UNIVERSAL DEPENDENCIES WORKSHOP (UDW 2017), 2017, Gothenburg, Proceedings [...]. Gothenburg: Association for Computational Linguistics, 2017. p. 86-95.

RADEMAKER, A. CHALUB, F.; REAL, L.; FREITAS, C.; BICK, C.; de PAIVA, V. Universal Dependencies for Portuguese. In: INTERNATIONAL CONFERENCE ON DEPENDENCY LINGUISTICS (DEPLING 2017), 4th., Pisa, Proceedings [...]. Pisa: Linköping University Electronic Press, 2017. p. 197-206.

RUANO SAN SEGUNDO, P. A Corpus-Stylistic Approach to Dickens’ Use of Speech Verbs: Beyond Mere Reporting. Language and Literature, [S.l.], v. 25, n. 2, p. 113-129, 2016. DOI: https://doi.org/10.1177/0963947016631859

SAMPSON, G. Empirical Linguistics. London: Continuum, 2001.

SANTOS, D.; BICK, E. Providing Internet Access to Portuguese Corpora: the AC/DC project. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2000), 2nd., Atenas, Proceedings [...]. Atenas: ELRA, 2000. p. 205-210.

SANTOS, D. Linguateca’s Infrastructure for Portuguese and How It Allows the Detailed Study of Language Varieties. OSLa: Oslo Studies in Language, Oslo, v. 3, n. 2, p. 113-128, 2011. DOI: https://doi.org/10.5617/osla.100

SANTOS, D.; FREITAS, C.; BICK, E. OBras: A Fully Annotated and Partially Human-Revised Corpus of Brazilian Literary Works in the Public Domain. 2018. Disponível em: https://opencor.gitlab.io/corpora/santos18obras. Acesso em: 8 de out. 2020.

de SOUZA, E.; FREITAS, C. ET: uma Estação de Trabalho para revisão, edição e avaliação de corpora anotados morfossintaticamente. In: WORKSHOP DE INICIAÇÃO CIENTÍFICA EM TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (TILic 2019), VI., 2019. Salvador. Proceedings [...]. Salvador: SBC, 2019. p. 15-18.

STRAKA, M.; HAJIC, J.; STRAKOVÁ, J. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In: TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC’16), 10th., Portorož, Proceedings [...]. Portorož: ELRA, 2016. p. 4290-4297.

DOI: http://dx.doi.org/10.17851/2237-2083.29.2.1033-1058