Modeling for Keyword Extraction : using Natural Language Processing methods for keyword extraction in Portal

This article aims to evaluate the application of two efficient automatic methods for keyword extraction used by Corpus Linguistics and Natural Language Processing communities for generating keywords from literary texts: WordSmith Tools and Latent Dirichlet Allocation (LDA). These tools have their own specificities and are based on different extraction techniques; thus an analysis focused on their performance was required. This article aims to understand how each method works and to evaluate them when applied to extract keywords from literary works. To this end, we used human DOI: 10.17851/2237-2083.23.3.695-726 RELIN, v. 23, n. 3, 2015. 696 analysis, with knowledge of the field of the texts used. The LDA method was used for extracting keywords through its integration with Portal Min@s: Corpora de Fala e Escrita, a general corpora-processing system, designed for different research in corpus linguistics. The experiment outcomes confirm the effectiveness of WordSmith Tools and LDA in extracting keywords from literary corpus. They also show that human analysis of the lists is required at a stage prior to experiments to complement the automatically generated list, crossing WordSmith Tools and LDA results, and that the linguistic intuition of a human analyst about the lists generated separately by the two methods in this study was more favorable to the use of the WordSmith Tools keyword list.

topics, respectively, which are approximations to search words, helping search engines.
Although there are separate analyses of the techniques used in each method, there is no study comparing methods for keyword extraction in literary works, which is the focus of this paper.There is no evaluation of topic extraction methods for this scenario either.The application of these methods to corpora of fictional texts, for example, must take into account some issues, given the nature of prose fiction.Such corpora, if comprised of novels, in addition to being larger than corpora of academic texts, for example, do not allow prior definition of a fixed vocabulary used in the texts, which happens only with prior knowledge of each text.
This article main goal is to compare two efficient automatic methods for keyword extraction used by CL and NLP communities for generating keywords from literary texts: WordSmith Tools keyword extraction and LDA keyword extraction.A secondary objective was to define whether these methods are effective and accurate when applied to the literary genre.
As both tools have their own specificities and are based on different extraction techniques, an analysis focused on their performance was required.To this end, we have applied human analysis, with knowledge of the field of the texts used.The LDA method was used for keyword extraction from its integration with Portal Min@s: Corpora de Fala e Escrita, a general corpora-processing system, designed for different research in corpus linguistics.The Portal has a web interface, which facilitated the efficient analysis of the data presented in this paper.
This text is organized as follows.Section 2 presents the theoretical foundation for this paper, including the tools, methods and corpora used.Section 3 details the experiments.Section 4 presents a discussion of the outcomes.Finally, Section 5 brings the final considerations.

Theoretical Foundation
This section provides details on the aforementioned methods, including a presentation of Portal Min@s, which shows the implementation of one of the extraction methods, LDA.In Section 2.1, we describe the methods for keyword extraction commonly used in the two areas of research: CL and NLP.In Section 2.2, the corpus used for the experiments.In Section 2.3, Portal Min@s, illustrating the use of the LDA method.

Methods for Keyword Extraction
Keywords are defined as a set of terms in a document that provides a summary of its content to readers (LIU et al., 2010).Several areas, such as Information Retrieval, CL and NLP, make use of keyword extraction for various tasks such as document categorization, clustering and summarization.Several methods for keyword extraction have been proposed and different software implemented.Popular methods and software include WordSmith Tools, TF-IDF, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) and WMatrix.
WordSmith Tools is a software for Windows that is easy to install and user-friendly.It has three main features: concordancing, keyword extraction, and word listing.To enable extraction, you must first generate the word lists of the texts.WordSmith Tools generates keywords when comparing the list of frequent words of two corpora, being the first corpus called "study corpus" and the second, "reference corpus".The reference corpus is used only to allow comparison, setting an average frequency at which the terms normally occur.Any terms present in the study corpus that show a significant change from the normal frequency set based on the reference corpus, for more and for less, will be highlighted as keywords (SARDINHA, 2000, p. 8).The only requirement of the tool is that the reference corpus is greater than the study corpus.According to Sardinha (2000, p. 12), it is ideal that the reference corpus is at least five times the size of the study corpus because it was observed that smaller reference corpus generates a smaller number of keywords, while those five times greater than the study corpus tend to maintain a reasonable number of keywords.The advantage of using WordSmith Tools is that this method is vastly used by corpus linguists, thus simplifying the process of comparing results on different papers.It is also worth noting that WordSmith Tools uses mature statistical methodology for keyword extraction.
Other popular approach for keyword extraction is TF-IDF (MANNING; SCHÜTZE, 2000) is a statistical method used to show how important a word is for a text within a corpus.It is the result of multiplying two frequencies of words: the term frequency (TF) and the inverse document frequency (IDF).In practical terms, it compares the frequency at which a word occurs in a text with the frequency at which the same word occurs in the corpus as a whole, enabling the creation of a ranking of frequency of words in a text based on their frequency in the larger whole.It is a very practical method of finding the weight and importance of the words of a text in a specific field of study.It also has the advantage of not relying on a stop-words list, since the most frequent words in the corpus texts have less weight and are last in the ranking of importance.The main advantage of TF-IDF is its simplicity and easiness to implement in software, which is ideal for tool developers and experimenters with background in programming languages.
LSA (LANDAUER, et al., 1998) is a method that measures the level of similarity of words and excerpts of texts after analyzing a corpus.It assumes that words with the same meaning, or a similar one, must appear in similar contexts.The method uses a mathematical technique called Singular Value Decomposition (SVD) to build a matrix of word occurrence in the paragraphs and corpus texts, enabling to determine the weight of each word in a text.After creating the occurrence matrix, the LSA must find a minimum version of this matrix to reduce the analysis complexity, minimize noise, and unify the occurrence of words of similar meaning, including the frequency of these words in texts corpus.This method is also commonly known as LSI (Latent Semantic Indexing).It is indicated as suitable for uses in which part of the keywords are already known before the extraction takes place.
LDA, inspired in LSA, is a probabilistic model for generating topics.On this model, Blei (2012) states: "To this end, machine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate large archives of documents with thematic information."(p.77).That is, this kind of model aims to analyze a corpus and define, through probabilistic algorithms, the topic addressed in the text.According to Blei (2012, p. 78), the idea behind LDA is that a text deals with multiple subjects, and these subjects can be represented by topics.A topic is formally defined as a distribution of words of a fixed vocabulary, and each document displays its topics in different proportions.
The method has been defined by Blei, Ng and Jordan (2003) as a hierarchical Bayesian model, which seeks to highlight the content of a document by representing its topics, generated through efficient approximation techniques.The document is seen as a mixture of topics following some probability distribution.The method seeks to associate each sentence with at least one topic.The same sentence can be associated with more than one topic.In this case, an estimate is made on the relation of the sentence with each topic in question.All calculations are made based on the bag of words of the document.The LDA method is applied for retrieving information in a first phase that clusters words by topic.In a second phase, the documents are clustered according to their topics (TDK TECHNOLOGIES, 2015).LDA main advantage is its ability to cluster keywords in topics.
Finally, WMatrix is a web platform that works in any system or environment, and which uses the so-called Matrix method.This method makes statistical comparisons in corpora, generating accurate keywords (RAYSON, 2002).It was developed based on a comparison of corpus analysis tools to provide a solution that has all the required and appropriate features for editing and extracting information from a corpus.The idea is not to eliminate the need for further human analysis, but to refine the most the initial analysis and allow a human being to examine a much larger volume of texts with less effort.According to Rayson (2002, p. 153), its main applications are: in the study of vocabulary according to social contexts; contrast native and non-native English speakers; and in the semantic analysis of documents that require software engineering.Wmatrix uses 21 semantic categories 7 to classify a text, including the Portuguese language and others.The categories are comparable to the topics that LSA and LDA raise automatically, but are fixed.Works like McIntyre and Walker (2010) prove the effectiveness of WMatrix in corpora analysis of poetry and dramatic texts, but as far as we know, there are no empirical studies on its application to long literary texts such as novels.Thus, one of WMatrix advantages relies on the fact it has been applied to keywords extraction in less common genres such as poetry and dramatic texts.
Considering the presented methods and tools, the scope of research work was adjusted to compare WordSmith Tools and LDA, so the analysis of other strategies has been left for future works.WordSmith Tools was selected for being widely used in the linguistic community, considered as a classic tool for extracting keywords.LDA was selected for bringing an innovative approach based on topics, which allows a comparative analysis of its strengths and limitations in relation to a classic strategy.

Corpora
The corpora described in this section were the basis for the experiments described in Section 3. Two categories of corpora were used: study (literary) and reference (to support keyword extraction).
The corpus of literary texts (study corpus) used in the experiments described here was compiled with translations of Heart of Darkness, a Joseph Conrad novel published in 1899, with each version in Portuguese produced by a different translator (Mark Santarrita, in 1996; Regina Regis Junqueira and Hamilton Trevisan, in 1984).The corpus was digitalized from their printed versions in books.The numbers of tokens and sentences in these three documents are shown in Table 1.

Table 1 -Tokens and sentences by translation
The reference corpus used was PLN-BR GOLD, which comprises 1,024 texts and 338,441 tokens and was compiled in the PLN-BR project (MUNIZ et al., 2007), having only news and coverages for which Folha de S.Paulo (a Brazilian newspaper) has republishing rights.The size of this corpus represents 1% of the largest corpus in the PLN-BR project, the PLN-BR FULL, in order to proportionally preserve the distribution of this largest corpus.It is a stratified random sample proportional to the distribution of the PLN-BR FULL corpus, which covers the years 1994-2005, related to texts published by newspaper Folha de S.Paulo.

Portal Min@s
Portal Min@s 8 intends to provide a unified and systematized computational base for processing corpus compiled and made available for linguistics research.The system was motivated by the recent expansion of research in corpus linguistics, which brought large amounts of corpus demanding robust processing.The Portal, as well as other efforts, seek to meet this demand.The differential in relation to other tools is its generalist approach, seeking to act in different contexts, based on the type of corpus (for example, annotated or not) and task (for example, translation and lexicography) or area of study in corpus linguistics.The tool is free and publicly available to all institutions interested in its use, requiring only a web-hosting server.The advantages of offering a web system are the possibility of simultaneous remote access by multiple users, without the need of installation in each user's computer.
Several features have been implemented, given the generalist focus of the Portal, treating different types of corpus, different languages and different research in corpus linguistics.Examples of features include generation of concordances, alignments for parallel texts and extraction of keywords, focus of this paper.Unlike corpus processing portals that work with a fixed list of corpus (DAVIES, 2005(DAVIES, , 2009)), the Portal allows the creation of corpus on demand and offers a module to import previously existing corpus, provided that they meet the format requirements of the Portal.
The Portal is based on eight core modules for general features to access corpora, and six support modules, created for managing and importing corpora.The core modules are: concordances, alignments, statistics and frequencies, keywords (focus of this paper), annotations, and multimodal corpus.In addition to the core modules, the Portal also has supporting modules, responsible for managing users, as well as importing and managing corpora, namely: importing module, text manager, corpus manager, subcorpus manager, tag manager and user manager.
In particular, the module used in this work allows automatic keyword generation.In this task, the Portal acts as an interface for the LDA-C tool (BLEI, 2003), an implementation of the LDA method with only command line interface available.During the extraction process, the user must inform the number of topics to be extracted.With a single topic, the LDA is similar to other methods used in keyword extraction software.From two or more topics on, however, the tool clusters keywords into coherent topics using statistical techniques.
The keyword extraction module has a practical interface and does not require any technical knowledge from the user, who can perform automatic extraction of texts with only three parameters: the number of topics, the number of keywords for each topic and the analysis text, as illustrated in Figure 1.An optional fourth parameter is a customized list of stop words that can be sent by the user; although Portal Min@s has a standard list for Portuguese, English and Spanish, the user has the option of using their own list to perform extraction in a specific text.The list of extracted keywords is displayed on the screen, as shown in Figure 2, but it is also available for download in plain text format.

Experiments
The idea of the study was to generate keywords through WordSmith Tools and LDA (Section 3.1), and then make comparisons between the keywords generated by analyzing which of them are common and which are specific to each method, and thus understand how each method works and their effectiveness for literary texts (Section 3.2).
In the experiments described in this article, we used only 800 of the 1,024 texts of the reference corpus, PLN-BR GOLD, enough to obtain a subcorpus approximately 5 times greater than the study corpus, following Sardinha's recommendations (2000).The subcorpus had 226,722 tokens and 17,002 sentences.
To apply the methods and compare their keyword extraction, we carried out studies with the literary corpus described in Section 2.2.

Keyword generation via WordSmith Tools and LDA
WordSmith generates a list of keywords from each corpus' list of word frequencies.The generated list has both positive and negative keywords.Positive keywords are the words of the study corpus that occur more frequently than the average set by the reference corpus, and negative keywords are those below the average rate.For this study, we have considered only positive keywords.Thus, three documents of keywords were generated: one for each translation.The first translation had 61 keywords, the second had 66 and the third had 60.
For the second method, the Latent Dirichlet Allocation (LDA), we have used implementation in language C, provided by one of the authors, called LDA-C (BLEI, 2003).The version also brings a Python script, which helps generating the list.The existing documentation is not complete; therefore, some details of the tool operation are not clear.As a result, this experiment sought to use as parameters the default values recommended by the documentation.
WordSmith and Portal Min@s generate slightly different frequency lists due to variations in the tokenization process.For this reason, we made some changes in the experiments to use the WordSmith frequency list as input to LDA.Two scripts have been developed to adapt the output to the format: where [M] is the number of single terms in the document, and the number associated with each term means how many times the term occurred in the document (BLEI, 2003).These generated files enable to run LDA-C and generate output files with the keywords.In an exploratory approach, two topics were defined for extraction through the LDA method.An additional script was made to post-process the output and generate the files to be submitted for human analysis.
After applying the tools in the three case studies, 9 files were obtained: • 3 files containing lists of keywords generated by WordSmith Tools, one for each translation; and • 6 files containing the topics and their keywords generated by LDA, two files for each translation.
From these 9 files, it was possible to cross data and generate new results to refine the final analysis.Two new processes were performed: the intersection between the lists of keywords generated by each tool separately, highlighting in which case study each keyword occurs (TABLE 2), as well the intersection between the final values of each tool (TABLE 3).The next step was to analyze these results and come up with a hit rate for each tool.To this end, the candidates generated were analyzed by an expert in the respective text.This analysis sought to determine whether the tools studied are accurate enough to extract keywords from a literary corpus.

Analysis
The analysis of keyword candidates started with the lists generated by the Keywords module of WordSmith Tools 6.0.The method comprises the analysis of the combined keywords in all the three translations.The keywords were automatically combined into a single list using a Python script implemented specifically to this task.
In sum, the translations provided 93 unique words, 37 of which were common to all three translations.The keywords were mostly lexical, except for parecia(m) (the verb 'to seem' in Portuguese), forms of address and proper names.This list of words is shown in Table 5

(see APPENDIX A).
The list from the LDA keywords for the two topics of each text contained 249 unique words, 78 of which were common to both topics of all three translations.The keywords were mostly lexical, except for parecia(m) (the verb 'to seem' in Portuguese), forms of address and proper names.This list of words is shown in Table 6 (see APPENDIX A).
For the analysis of candidates for keywords in the lists generated by LDA, we decided to use an extra approach due to the difficulty in naming a topic for each of the two lists generated for each text, given the similarity of the lists and adversity of words listed.The methodology used in the analysis is described below: 1) the words were clustered into microtopics; 2) some of these topics were discarded because they are more related to textual characteristics (for example, the topic description of actions, relations and utterances that integrated verbs; the topic that comprised first-person pronouns, and topics that we could not identify because they are conjunctions, interjections etc.); and 3) microtopics were clustered into macrotopics, resulting in two macrotopics, although we were not sure whether some of the words they integrated were relevant.
The results of the analysis of lists generated by LDA are presented below.
Topics 1, 5 and 9 were disregarded.Microtopics 2, 4, 7 and 8 were clustered into a "human nature" macrotopic (34 words in total) and microtopics 3 and 6 were clustered into a 'physical nature' macrotopic (42 words in total).Based on the number of words, we can say that the 'physical nature' macrotopic is predominant in this list.
Microtopics 1 and 5 were disregarded.The microtopics 2, 4, 7 and 8 (34 words) were reclustered into a macrotopic called 'human nature' and microtopics 3 and 6 (35 words) under the 'physical nature' macrotopic.The two macrotopics have a balanced number of words in this list without predominance of any of them.This list has words not listed in the previous one and the candidates for keywords, according to their frequency, could vary.An example is 'Kurtz', the most representative character of human degradation at the end of the colonization in the 19 th century.
The results show that the words clustered into the "physical nature" macrotopic, comparing the two lists, Topic 1 and Topic 2, have the most occurrences, confirming the dominance of this topic in the text.Words of higher frequency in this list could also be the candidates for keywords.
Topics 10, 14 and 18 were disregarded.Microtopics 11, 13, 16 and 17 (33 words in total) were clustered into a "human nature" macrotopic and microtopics 12 and 15 (36 words in total) were clustered into a 'physical nature' macrotopic.It can be argued that the 'physical nature' macrotopic, with the largest number of words in this list, would be the predominant topic of the list.
The results show that the "human nature" macrotopic, comparing the two lists, Topic 1 and Topic 2, has a word more than the 'physical nature' macrotopic.In this list, we can say that the two topics are balanced, so the predominance of one or the other is not confirmed.Words of higher frequency in this list could also be the candidates for keywords.

HOD_Santarrita96
For HOD_Santarrita96_tópico1, with 98 words, 15  The results show that the "human nature" macrotopic, comparing the two lists, Topic 1 and Topic 2, has a word more than the 'physical nature' macrotopic.In this list, it can be said that the 'physical nature' topic is predominant.Words of higher frequency in this list could also be the candidates for keywords.
Section 4 discusses the results obtained by applying both methods on the literary corpus, and comparing their performance based on a quality-oriented analysis.

Discussion of Results
The experiment of human analysis of the lists generated by the keyword module (WordSmith) selected 37 keywords from the first list of 93 words, and 78 from the second list of 249 words.Most candidates for keywords in these lists include lexical words like nouns and adjectives, among others, with some occurrences of grammatical words such as the verb parecer (to seem).The occurrence of inflected words from this lemma as keywords was expected, since the version of this lemma in English, seem, was also noted by Stubbs (2005) as highly frequent in the word list of the novel in English.Returning to the lists of the experiment of human analysis, among the candidates are the 10 keywords presented in Table 4, illustrating the results obtained when considering the two tools used in the first experiment; for example, Kurtz, protagonist whose search and story is told by the sailor-narrator, and ivory, object of travels during the colonization and exploration of continents such as Africa by some of the empires in the 18 th and 19 th centuries.It also presents words like cabeça (head) and olhos (eyes) that indicate a fragmented representation of the native population found in the description of the narrator (MAGALHÃES; ASSIS, 2009).However, this framework does not provide list words such as parecia (seemed / singular), pareciam (seemed / plural), escuridão (darkness), among others, indicators of key topics in the texts, such as uncertainty (STUBBS, 2005) and difficulty to understand what you see in a very different violent clash of cultures with a very distinct power position.The experiment of human analysis of word lists generated by the LDA method allowed the clustering of these words into very general topics, such as physical and human nature.It also pointed to the fact that the words found in lists in which the physical nature topic predominates would most likely be keyword candidates.On the one hand, the two general topics, physical nature and human nature, permeate the work, which can be interpreted as a representation of the Western man's confrontation with the devastation of human and physical natures in a context of colonization.For example, some words from these lists coincide with some of those presented in Table 4, such as Kurtz, pilgrims, earth and ivory.On the other hand, these general topics leave out words generated by the LDA lists (qualquer [any], nenhuma [none], parecia [seemed], etc.), also constituents of other more specific topics, but fundamental in the novel, such as the uncertainty and difficulty of understanding the landscape in the novel.Other words such as escuridão (dark), trevas (darkness), sombra (shadow) are key indicators of the specific topics are not necessarily inferred from general topics either.ics.

<0}
We can also say that if there are corpus-based research results which show that conjunctions have a low frequency in novels, an investigation of the occurrence of words such as '(no) entanto' ('however'), 'porém' ('but'), 'embora' ('although') (the latter if it is in fact a conjunction according to the context), the LDA topics lists could be justified in order to verify the adversative conjunction function in this novel.
On Section 5, final remarks on the study are made.

Final remarks
The first aim of this study was to compare the extraction of keywords from a literary corpus using two popular tools, with each of them having a specific operation and efficiency proven in other types of texts and documents.In short, we applied the WordSmith Tools and LDA in three different translations of the same literary work.The WordSmith Tools method compares the text word frequency with a reference average frequency and determines keywords according to the words whose frequency is notably larger.LDA, on the other hand, makes a probabilistic analysis of the text word frequency and generates keyword divided into topics.The lists generated by each tool were concatenated, facilitating the analysis of the results.
The second aim was validate the results of the aforementioned experiment, resulting from an experiment of human analysis of keyword lists obtained with the two tools that would indicate the keyword candidates.In this experiment, we can conclude that for long texts, such as novels, the human analysis of lists is necessary at a stage prior to experiments to complement the list automatically generated, crossing the results of the tools.A suggestion for a future experiment would be to anticipate the human analysis procedure, i.e., before the integration of the results obtained automatically with the two tools, or even before setting the parameters that allow running LDA, generating more specific and robust results for larger literary texts or corpora.
The results indicate that both methods can be applied to the literary text.However, we observed that human analysis and refining of the methods are crucial.We should also mention that the linguistic intuition of human analysts on examining lists generated by each of the two methods in this experiment was more favorable to the use of WordSmith Tools keyword lists.
In this paper, two topics were automatically generated by LDA, which tended to cluster keywords into two macrotopics from a linguistic point of view: human and physical nature.Future research work include: (a) use LDA extraction with three computational topics and analyze the macrotopics obtained; (b) use a larger number of computational topics to see if the resulting lists tend to be clustered according to the linguistic microtopics obtained in the human analysis; and (c) include new methods and tools in the analysis, such as WMatrix.

Figure 1 -
Figure 1 -Interface of the keyword extraction tool of Portal Min@s

Figure 2 -
Figure 2 -Example of keyword extraction with display of results

Table 2 -
Number of keywords generated in each translation (case) for each tool

Table 3 -
Keywords generated by each tool, considered as a whole Table4presents the 10 best candidates for keywords, i.e. those words generated by both tools and that occur in the three translations (3 cases).

Table 4 -
The 10 best candidates for keywords