<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="../../common/schema/DHQpublish.rng" type="xml"?>
<DHQarticle xmlns="http://digitalhumanities.org/DHQ/namespace" xmlns:cc="http://web.resource.org/cc/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <DHQheader>
        <title>Vive la Différence! Text Mining Gender Difference in French Literature
        </title>
        <author>
            <name>Shlomo
                <family>Argamon</family>
            </name>
            <affiliation>Linguistic Cognition Lab, Dept. of Computer Science, Illinois Institute of Technology</affiliation>
            <email>argamon@iit.edu</email>
            <bio><p>Shlomo Argamon is an associate professor of computer science at the Illinois Institute of Technology, where he is the director of the Linguistic Cognition Laboratory.  He received his B.Sc. in applied mathematics from Carnegie-Mellon University in 1988, his Ph.D. in computer science from Yale University, where he was a Hertz Foundation Fellow, in 1994, and was a Fulbright Fellow at Bar-Ilan University in Israel from 1994 to 1996.  Dr. Argamon's research focuses on the development of computational text analysis techniques, with applications mainly in computational stylistics, authorship attribution, sentiment analysis, and scientometrics.</p></bio>
        </author>
        <author>
            <name>Jean-Baptiste
                <family>Goulain</family>
            </name>
            <affiliation>Linguistic Cognition Lab, Dept. of Computer Science, Illinois Institute of Technology</affiliation>
            <email>jibai31@gmail.com</email>
            <bio><p>Jean-Baptiste Goulain received his diplôme d'ingénieur (2007) in computer science and applied mathematics from École Nationale Supérieure d'Informatique et de Mathématiques Appliquées in Grenoble, France.  During this time, he spent a semester at the Illinois Institute of Technology where he was a member of the Linguistic Cognition Laboratory.  He is currently a student intern at Société Générale bank in New York.</p></bio>
        </author>
        <author>
            <name>Russell
                <family>Horton</family>
            </name>
            <affiliation>Digital Library Development Center, University of Chicago</affiliation>
            <email>russ@diderot.uchicago.edu</email>
            <bio><p>Russell Horton is a research programmer at The ARTFL Project and the Digital Library Development Center at the University of Chicago, where he received his BA in Linguistics in 2002. He works on machine learning and text analysis software for the humanities.</p></bio>
        </author>
        <author>
            <name>Mark
                <family>Olsen</family>
            </name>
            <affiliation>ARTFL Project, University of Chicago</affiliation>
            <email>markymaypo57@gmail.com</email>
            <bio><p>Mark Olsen is the Assistant Director of the ARTFL Project at the University of Chicago.  Mark received his Ph.D. in French history from the University of Ottawa in 1991 and has been involved in digital humanities and computer-aided text analysis since the early 1980s.  His current ambition is to write a biography of the Marquis de Pastoret in candle-light with a quill.</p></bio>
        </author>
        <publicationStmt>
            <idno type="DHQarticle-id">000042</idno>
            <idno type="volume">003</idno>
            <idno type="issue">2</idno>
            <issueTitle>Spring 2009</issueTitle>
            <articleType>article</articleType>
            <date when="2009-06-18">18 June 2009</date>
            <availability>
                <cc:License rdf:about="http://creativecommons.org/licenses/by-nc-nd/2.5/"/>
            </availability>
        </publicationStmt>
        
        <langUsage>
            <language id="en" role="primary"/>
        </langUsage>
        <history>
            <revisionDesc>
                <change when="11-20-08" who="Alyssa">Began encoding</change>
                <change when="12/12/08" who="Melanie Kohnen">added email addresses</change>
                <change when="12/19/08" who="Melanie Kohnen">added bios, teaser</change>
                <change when="01/13/09" who="Melanie Kohnen">added Horton's email address</change>
            </revisionDesc>
        </history>
        <abstract>
            <p>In this study, a corpus of 300 male-authored and 300 female-authored French literary and historical texts is classified for author gender using the Support Vector Machine (SVM) implementation SVMLight, achieving up to 90% classification accuracy. The sets of words that were most useful in distinguishing male and female writing are extracted from the support vectors. The results reinforce previous findings from statistical analyses of the same corpus, and exhibit remarkable cross-linguistic parallels with the results garnered from SVM models trained in gender classification on selections from the British National Corpus. It is found that female authors use personal pronouns and negative polarity items at a much higher rate than their male counterparts, and male authors demonstrate a strong preference for determiners and numerical quantifiers. Among the words that characterize male or female writing consistently over the time period spanned by the corpus, a number of cohesive semantic groups are identified. Male authors, for example, use religious terminology rooted in the church, while female authors use secular language to discuss spirituality. Such differences would take an enormous human effort to discover by a close reading of such a large corpus, but once identified through text mining, they frame intriguing questions which scholars may address using traditional critical analysis methods. </p>
        </abstract>
        <teaser>Patterns of gender difference in historical French texts parallel those in modern English.</teaser>
    </DHQheader>
    <text>
        <head>Vive la Différence! Text Mining Gender Difference in French Literature</head>
        <div>
            <cit><quote rend="block"><p>Amanda Bonner: What I said was true, there's no difference between the sexes. Men, women, the same.</p>
                <p>Adam Bonner: They are?</p>
                <p>Amanda Bonner: Well, maybe there is a difference, but it's a little difference.</p>
                <p>Adam Bonner: Well, you know as the French say...</p>
                <p>Amanda Bonner: What do they say?</p>
                <p>Adam Bonner: Vive la difference!</p>
                <p>Amanda Bonner: Which means?</p>
                <p>Adam Bonner: Which means hurrah for that little difference.</p>
            </quote>
            <ref>Adam's Rib, 1949</ref></cit>
        </div>
        <div>
            <head>Introduction</head>
            <p>Attempts to identify and characterize differences between male and female discourse have utilized methods such as close reading, sociolinguistic modeling <ptr target="#Tannen1994"/>, statistical analysis <ptr target="#Olsen2004"/>, <ptr target="#Olsen2005"/>, and, more recently, machine learning <ptr target="#Koppel2002"/>, <ptr target="#Argamon2003"/>. The machine learning approach is closely related to purely statistical analysis methods; both approaches exploit differences in aggregate word frequencies to highlight differences between male and female authors in content or style. One advantage of machine learning over simpler forms of statistical analysis lies in its creation of a predictive model of testable accuracy, that can be used to assign gender labels to samples of unknown category, or, as in this study, interrogated to reveal the features most useful in such a classification. The resultant weighted wordlists can be used to support or weaken an existing hypothesis about differences between the corpora, or suggest new directions for investigation, whether by additional machine learning or other, more traditional, critical methods.</p>
            <p>This study was based on the same male and female corpora used by Olsen in previous statistical analyses <ptr target="#Olsen2004"/>, <ptr target="#Olsen2005"/>. The female corpus was assembled first, due to the more limited digital collection of women's writing at our disposal. 300 texts roughly balanced by genre, collection and time period were chosen, from among texts by French women writers available to us. For each of the 300 texts by 67 female authors (18.5 million words), we selected the chronologically closest male document available in that same genre and, when possible, same collection, leading to a comparison corpus of 300 texts by 170 male authors (27 million words). As noted by Olsen <ptr target="#Olsen2004"/>, although these texts range from the 12th - 20th centuries, the samples are largely drawn from the 18th-early 20th centuries with strongest representation in the 19th century, owing to the predominance of romantic novelists in the available collections of female writing. The sample is also skewed by a disproportionate number of works by several notable authors, in particular George Sand, with 77 works. Two subsets of the main corpora, each containing 92 documents selected from either the male- or female-authored set, were also selected in an effort to avoid the <called>Sand Effect.</called>  </p>
        </div>
        <div>
            <head>Comparison With Previous Research</head>
            <p>Because we are working with the same corpus previously subjected to a purely statistical analysis <ptr target="#Olsen2004"/>, <ptr target="#Olsen2005"/>, we can bring machine learning tools to bear on the questions posed by that work and directly compare our results. Machine learning allows us the possibility of approaching the issue of male and female authorship from a different angle, with a set of metrics for success fundamentally different from those afforded by traditional text analysis methods and statistical inquiries. We ask an SVM model to learn, to the best of its ability, to discriminate between male- and female-authored documents by feeding it labeled examples of each, and applying an algorithm designed to generate predictive models by exploiting generalizable differences in word frequencies between documents in each set. The models give us quantitative feedback regarding their accuracy in their task, and expose their methods by outputting lists of the words which were their input, ranked and weighted as being predictive of one gender or the other. While these metrics do not assure us of an intellectually satisfying outcome from a literary critical viewpoint, they provide a good test of the validity of our process of analysis. </p>
            <p>Because machine learning algorithms are fundamentally rooted in the exploitation of differential distributions of features (in our case, words), we would expect to see many of the same words appear as highly weighted features in our machine learning results that Olsen found to be significant in his statistical analysis. However, we would not expect the lists to be identical because there are additional factors that influence SVM trained weights that are not captured by differential frequency statistics or other statistical measures such as information gain (IG). Differential frequency and IG are innate properties of an individual word's distribution between sub-corpora, whereas an SVM weight has meaning only within the context of a particular model generated by the learning algorithm, and must be considered in relation to the weights of other features in that model. Differential rates and IG may simply be calculated according to a set formula with unvarying results, whereas SVM weights are heuristically assigned and refined by the learning algorithm in a search for maximum performance on the classification problem. </p>
            <p>Information gain and other statistical measures of distribution are commonly used as heuristics for reducing feature set dimensionality and for setting initial weights for machine learning algorithms, but there is no guarantee that all words with highly differential frequencies in the corpora will be assigned high weights by the machine learner in the final model. SVM produces two weighted sets of words, male and female, which, taken together, are maximally effective (to the extent of the ability of the algorithm to produce an optimal solution) at discriminating between texts from the two corpora. Words which might exhibit interesting distributions but which do not fit well into a particular model will not be assigned high weights and will escape our notice. Therefore, it is useful to perform a variety of machine learning runs, find what works, and search for common threads in the results. Ultimately, results must find support from a knowledgeable reading of the texts and be fitted with a critical hypothesis to be of great interest from the literary scholar's point of view, although predictive models may have practical uses, such as adding guessed metadata to unclassified documents, independent of their critical value or validity. </p>
        </div>
        <div>
            <head>Experimental Design</head>
            <p>The machine learning algorithm chosen for this classification task is an SVM implementation called SVMLight <ptr target="#Joachims1999"/>. SVM has proven to be a model well-suited for text classification, and our initial tests showed that SVMLight achieved the best accuracy in classification among learning algorithm implementations at our disposal, including naive Bayesian and decision tree learners. The SVMLight implementation is freely available and includes key capabilities such as cross-validated accuracy measures via leave-one-out estimation and the ability to extract the weights assigned to each feature. The ability to interrogate the model in this way is essential, because without it we would learn nothing about what word usage patterns distinguish male writing from female writing, merely that such a distinction can be learned with a particular degree of accuracy. A black-box model may be adequate for industrial applications, where the goal is to classify unclassified instances with a certain accuracy, but in this experiment, where the correct classification is already known for all texts, we are far more interested in picking apart the constructed model to determine the orientation and magnitude of the weights of individual words.  </p>
            <p>For our preliminary experiments, we prepared 8 sets of vectors, comprised of the two collections (the full 600 document corpus and 184 document subset) in four versions each: the surface form of the words, the lemmas, the parts of speech (POS) of the words as assigned by TreeTagger, and a simplified part of speech grouping, with broader categories (POSgroup). Each matrix consisted of either 600 or 184 vectors, labeled with 1 for male-authored and -1 for female-authored documents. For a look at the generic data preparation process for text classification, see <ptr target="#ARTFL2008"/>. </p>
        </div>
        <div>
            <head>Machine Learning Runs</head>
            <p>We then trained SVMLight on each matrix, and obtained the accuracies given in Tables 1 and 2, after cross-validation.  Surface form and lemma accuracies cluster around 85%, which means that overall, the models generated by SVMLight can correctly predict the gender of the author about 85% of the time. This is a significant result and indicates that the model has indeed found generalizable differences between the texts in the two corpora. The differences in accuracy between the surface and lemma forms of the words are insignificant, and the POS and POSgroup accuracy differences are generally quite slight as well. The most notable distinction is that POS/POSgroup accuracies are consistently much lower than word/lemma accuracies. The former hover around 70% accuracy, which we have adopted as the borderline for a significant result on a binary classification problem. 70% accuracy is not a particularly compelling result on a <called>coin-flip</called> problem, because it shows only 20% improvement over the agreement expected by random chance. Naturally, the more accurate our model is, the more importance we can attach to the words the model weights toward each author gender. </p>
            
            <table id="table01">
                <label>Preliminary results: 2x300 document sample </label>
                <row>
                    <cell role="label">  </cell>
                    <cell role="label">Word</cell>
                    <cell role="label">Lemma</cell>
                    <cell role="label">PoS</cell>
                    <cell role="label">PoSgroup</cell>     
                </row>
                <row>
                    <cell role="label">Male</cell>
                    <cell>88.3%</cell>
                    <cell>87.3%</cell>
                    <cell>73.0%</cell>
                    <cell>69.7%</cell>
                </row>
                <row>
                    <cell role="label">Female</cell>
                    <cell>83.3%</cell>
                    <cell>84.4%</cell>
                    <cell>75.7%</cell>
                    <cell>78.7%</cell>
                </row>
                <row>
                    <cell role="label">All</cell>
                    <cell>85.7%</cell>
                    <cell>85.9%</cell>
                    <cell>74.4%</cell>
                    <cell>74.2%</cell>
                </row>
            </table>
            <table id="table02">
                <label>Preliminary results: 2x92 document sample</label>
                <row>
                    <cell role="label">  </cell>
                    <cell role="label">Word</cell>
                    <cell role="label">Lemma</cell>
                    <cell role="label">PoS</cell>
                    <cell role="label">PoSgroup</cell>
                </row>
                <row>
                    <cell role="label">Male</cell>
                    <cell>91.3%</cell>
                    <cell>92.4%</cell>
                    <cell>73.9%</cell>
                    <cell>73.9%</cell>
                </row>
                <row>
                    <cell role="label">Female</cell>
                    <cell>81.5%</cell>
                    <cell>81.5%</cell>
                    <cell>78.3%</cell>
                    <cell>69.6%</cell>
                </row>
                <row>
                    <cell role="label">All</cell>
                    <cell>86.4%</cell>
                    <cell>87.0%</cell>
                    <cell>76.1%</cell>
                    <cell>71.7%</cell>
                </row>
            </table>
            <p>In order to test whether our accuracies were an artifact of the classifier used, rather than demonstrative of true differences between our corpora, we performed the same experiment but with each document randomly labeled as male or female, regardless of true author gender. Over multiple runs, the classifier never achieved more than 50% accuracy in this random falsification experiment, so we can be confident that SVMLight cannot reliably distinguish between and random sub-corpora grouping in this corpus. </p>
            <p>We can try to learn from our failures here. The fact that SVMLight cannot construct a very accurate prediction model based on POS vectors is a kind of weak evidence against any theory of gendered authorship that holds that men and women speak radically different languages. If, in fact, men and women used the basic building blocks of language in substantially different ways, we might expect to see strong mechanical differences between male and female writing reflected in POS usage rates that the model could exploit to make accurate classifications. That such differences do not widely obtain in this corpus is strongly suggested by the inability of SVMLight to construct a very accurate model to distinguish between the gendered corpora on that basis. Of course, this does not rule out mechanical and stylistic differences that aren't reflected in the simple metric of POS frequencies, but it does suggest a base level of linguistic similarity between the two classes. </p>
            <p>Based on these initial results, we decided to proceed with further experiments using the surface forms of the words, that being the simplest method and tied for most accurate with the lemmatized forms. All runs cited hereforth were executed within the PhiloMine data mining extensions to the PhiloLogic text search engine <ptr target="#PhiloMine2007"/>, and are based on vectors of surface forms, and in all cases we achieve an accuracy greater than 70%, most often between 80 and 90%. Now that we were comfortable that the accuracy of our models were significant enough to indicate real differences between our corpora, we investigated the internals of those models to determine where they get their predictive power. We began by extracting the weights assigned to each word in the 2 x 300 surface form features SVMLight model, and sorting them in descending order of magnitude. Words oriented toward male authorship are scored as positive decimals, while those pointing toward female authorship are negative decimals. We obtained the weights of the most influential words in the model, given in <ref target="#table05">Table 5</ref>. </p>
            <p>Our first impulse when examining the feature list was to scan for the presence of <called>shibboleth</called> words that trivially identify some subset of works as definitively male- or female-authored, either because they are explicit markers of author gender (such as metadata tags inadvertently retained in the document), or because they are features that occur in only one or a relative handful of works that are homogeneous for author gender. Such terms are gifts to the machine learner, greedily seized upon by our classification model but unlikely to generate any penetrating insight for the scholar. Proper names are the prime example of such features, and we saw several in <ref target="#table05">Table 5</ref>, <emph>Consuelo</emph> being the highest-ranked of these. We eliminated terms like <emph>Consuelo</emph> (present in a number of works by Sand) from the input our model receives by stipulating that we will only use words that occur in more than a certain percentage of documents in the corpora. Constructing new vectors using only words that occur in at least 5% of the documents in the combined male and female corpora, we ran the analysis again and extracted the weights for the word given in <ref target="#table06">Table 6</ref>. <emph>Consuelo</emph> is gone; a few proper names remain lower on the list, but since they occur in at least 5% of all documents, they may be of broad enough interest to retain.  </p>
            <p>The highest-ranked words in each category are common function words — pronouns, articles, quantifiers, adpositions, common verb forms of <foreign lang="fr"><emph>être</emph></foreign> and <foreign lang="fr"><emph>avoir</emph></foreign> — likely to occur frequently in texts of either gender. Several patterns are evident. The female preference for pronouns is quite marked; {<foreign lang="fr"><emph>elle, vous, lui, me, ma, moi, mon, il, m', je, toi, tu, votre</emph></foreign>} all appear in the top 200 features weighted toward female authors. This is not an unexpected finding given the observation of Olsen <ptr target="#Olsen2005"/> of a usage rate for these terms among female authors that is nearly 1.5 times that of male authors. Also of note is the female preference for terms of negative polarity: {<foreign lang="fr"><emph>impossible, ne, ni, pas, personne, sans</emph></foreign>}. On the male side, we note the preference for determiners such as {<foreign lang="fr"><emph>un, le, des, du, les, ce, ces, cette</emph></foreign>} and quantifiers such as {<foreign lang="fr"><emph>un, deux, une, quelque(s), mille</emph></foreign>}.   </p>
            <p>These results are striking in that they replicate almost exactly those of a similar analysis of female- and male- authored texts in the British National Corpus (BNC) <ptr target="#Argamon2003"/>. The primary findings of that study were that females tended to use both more personal pronouns such as {<emph>I, you, she, her, their, myself, yourself, herself</emph>} and negative particles such as {<emph>not, no, never</emph>}, and that males used more determiners such as {<emph>a, the, that, these</emph>} and quantifiers such as {<emph>one, two, more, some</emph>}. Although reflexive pronouns are not expressed by a single word in French as they are in English, and hence do not show up distinctly in our analysis, the rest of the findings match almost exactly. The issue of reflexive pronouns might be investigated in subsequent tests by using word bigrams as features rather than, or in addition to, single words. The strong agreement between these two experiments is all the more remarkable for the very different texts involved in these two studies.  Argamon et al. <ptr target="#Argamon2003"/> analyzed 604 documents from the BNC spanning an array of fiction and non-fiction genres from a variety of sources, all in Modern British English (post-1960), whereas the current study looks at predominantly fictional French works from the 12th - 20th centuries. This cross-linguistic similarity could be supported with further research in additional languages. </p>
            <p>Somewhat lower down the list than the function words, we start to encounter content words, and some of the same phenomena noted by Olsen in his statistical analyses are apparent. {<foreign lang="fr"><emph>aime, aimer, aimable</emph></foreign>} all show up on the female list, which squares with Olsen's observation of a use rate of <emph>aim*</emph> by females roughly at roughly 1.5 times that of males across all genres. In noting the female preference for personal pronouns and emotional language, Olsen argues <cit><quote rend="inline">[female] space may be characterized by a more personal, emotive and interactive frame that is not explained by differences in genre or period</quote> <ptr target="#Olsen2005"/></cit>, and we can support this hypothesis with our machine learning analysis. </p>
            <p>Having found support for previous findings in Argamon <ptr target="#Argamon2003"/> and Olsen <ptr target="#Olsen2005"/>, <ptr target="#Olsen2004"/>, we looked for additional patterns in the heavily weighted terms for each gender. Our corpus spans a wide time range, and we are most interested in discovering patterns that persist across that span. To that end, we split our 600 document combined male- and female-authored corpus into two time range sub-corpora, one comprised of all documents from 1100-1799 (244 documents) and one for all other documents, spanning 1800-2000 (356 documents). Separate SVMLight training runs were performed on each time range corpus using those words that appeared in at least 20% of all documents in that corpus, and the 500 highest-weighted features for male- and female-authored documents from each period were extracted. Taking the union of the two male lists and the two female lists, we found 153 male and 192 female features that are among the top 500 features for both time period runs. No single text or group of contemporary texts can force the inclusion of any word into these merged lists because each text occurs in only one time range sub-corpus, so inclusion on both lists indicates a widespread and enduring trend in usage. The relatively common words in Table 3 are consistently useful in distinguishing male and female French writing over a wide time range, and must reflect real differences in style or content between the genders in the corpora. </p>
            <table id="table03"> 
                <label>Features appearing in the top 500 highest-weighted in both time range models </label>
                <row>
                    <cell><emph>153 persistent features in Male-authored documents:</emph> <foreign lang="fr">1, a, abord, action, affaire, ajouta, amie, article, au, aura, auteur, autour, autre, aux, avons, bas, bouche, bras, c, capitaine, cent, chacun, chair, champ, charles, chez, christ, ciel, cinq, comment, comtesse, contre, corps, coup, coups, crime, côté, d', des, deux, diable, dis, docteur, doigts, dont, doute, droite, du, entre, est, face, fait, façon, femme, feu, fin, fit, fois, foule, gens, gros, haut, histoire, homme, hé, hôtel, ils, in, jacques, jean, juge, jusqu', la, laquelle, le, les, leurs, ligne, long, lorsque, main, mains, maîtresse, messieurs, mis, mit, moins, monseigneur, monsieur, montre, mot, même, nez, nom, nombre, nos, oeil, oeuvres, ordre, oreille, ou, oui, où, par, passage, pied, pieds, présente, président, prêtre, quatre, quelqu', quelque, quelques, question, qui, quoi, reprit, reste, rue, récit, saint, saints, salut, sang, second, seconde, selon, ses, seulement, simple, sire, soit, sous, sur, table, tirer, tour, toute, trente, trois, un, v, ventre, vers, vieux, village, vin, vingt, voici, y, yeux, à</foreign></cell>
                </row>
                <row>
                    <cell><emph>192 persistent features in Female-authored documents:</emph> <foreign lang="fr">192 persistent features in Female-authored documents: absence, admiration, afin, agréable, ai, aimable, aime, aimer, aller, amitié, amour, anglais, angleterre, auguste, auprès, aurais, avais, avait, avec, avez, avoir, beaucoup, belle, bien, bonheur, bonne, brillante, but, cacher, car, caractère, celle, chagrin, chercher, chère, coeur, comprendre, compte, comte, confiance, conserver, cour, crois, destinée, disant, donner, douceur, douleur, doux, elle, elles, empêcher, encore, enfance, enfant, enfants, entièrement, envie, esprit, espérance, estime, eût, faisait, fallait, faut, fièvre, fleurs, france, frère, fût, gloire, goût, grande, grandes, généreux, henri, hiver, ici, il, imagination, impossible, inquiétude, inspire, inspirer, instant, intérêt, jamais, jardin, jours, liberté, lui, lumières, m, ma, mais, malgré, manière, manières, me, moi, mon, montrer, mère, ne, ni, nécessaire, opinion, parce, parler, parlez, passion, pauvre, pays, personne, personnes, petite, peut, peuvent, plaire, plaisir, pleurs, plusieurs, possible, pourquoi, pourrais, pouvait, prince, princes, princesse, pu, puisque, puissance, père, quand, que, quitter, regarder, reine, repos, retrouver, revenir, roi, sais, sait, sans, savoir, secret, sentiment, sentir, seule, si, son, souffrir, souvenir, souvent, soyez, suis, supporter, surprise, tant, toi, toujours, tous, toutes, trop, trouva, trouver, très, tu, utile, veux, vie, vit, vivre, voir, vois, vos, votre, voulait, voulut, vous, voyage, voyant, véritable, âme, éducation, égard, égards, émotion, épouser, était, êtes</foreign></cell>
                </row>
            </table>
            <p>Within the male and female lists, it is possible to identify a number of interesting semantic groupings of words. Reassuringly, the female pronouns and negative polarity items and male quantifiers discussed earlier are still present. In addition, there are a number of other semantic categories of words that appear to cohere: </p>
            <table id="table04"> 
                <label>Subjective thematic groups among the persistent features</label>
                <row role="label"><cell>Enduring Male Terms</cell>
                    <cell>Enduring Female Terms</cell></row>
                <row>
                    <cell>
                        <list type="simple">
                            <item><emph>Quantifiers:</emph> <foreign lang="fr">quelqu', quelque(s)</foreign></item>
                            <item><emph>Religiosity:</emph> <foreign lang="fr">christ, ciel, corps, diable, saint(s), saints, sang(?)</foreign></item>
                            <item> <emph> Numericality:</emph> <foreign lang="fr">1, cinq, cent, deux, nombre, quatre, second(e), trois, trente, un, vingt</foreign></item>
                            <item><emph> Anatomy:</emph> <foreign lang="fr">bouche, bras, chair, corps, doigts, face(?), main, nez, pied(s), oeil, oreille, sang, yeux, ventre</foreign></item>
                            <item><emph>Authority:</emph> <foreign lang="fr">capitaine, docteur, juge, président, sire</foreign></item>
                            <item><emph>Other notables:</emph> <foreign lang="fr">action, amie, femme, feu, histoire, homme, maîtresse, rue, salut, vieux, village, vin</foreign></item>
                        </list>  </cell>
                
                    <cell>
                        <list type="simple"><item><emph>Pronouns:</emph> <foreign lang="fr">me, moi, mon, vos, votre, vous</foreign>   </item>
                            <item><emph>Spirituality:</emph> <foreign lang="fr">âme, chercher, coeur, destinée, espérance, esprit, imagination, inspire, inspirer, passion</foreign></item>
                            <item><emph>Quantifiers:</emph> <foreign lang="fr">tous, toutes, (toujours)</foreign></item>
                            <item><emph>Emotion:</emph> <foreign lang="fr">agréable, aimable, aime, aimer, amitié, amour, bonheur, douceur, douleur, doux, émotion, envie, espérance, plaire, plaisir, pleurs, sentiment, sentir, seule</foreign></item>
                            <item><emph>Family:</emph> <foreign lang="fr">enfant(s), épouser, frère, mère, père</foreign></item>
                            <item><emph>Nobility:</emph> <foreign lang="fr">prince(s), princesse, reine, roi</foreign></item>
                            <item><emph>Negatives:</emph> <foreign lang="fr">impossible, ne, ni, pas, personne, sans</foreign></item>
                            <item><emph>Other notables:</emph> <foreign lang="fr">éducation, impossible, inquiétude, gloire, liberté, lumières, opinion, pauvre, possible, puissance, quitter, sais, sait, savoir, secret, seule, souffrir, souvenir, supporter, surprise, vivre, voyage, voyant, voulait, voulut</foreign>  </item>
                        </list>   
                    </cell>
                </row>
            </table>
            <p>The number of strongly cohesive thematic groupings that can be constructed from the highly-weighted features that obtain in both time periods suggest that male and female writers in the corpus exercise markedly different topic selection. Although the identification of these persistent themes marks the endpoint of this machine learning analysis of the corpus, the themes themselves form a natural starting point for a scholar interested in pursuing the differences between male and female writing from a traditional literary critical viewpoint. It would be quite interesting, for example, to explore why male authors favor religious terminology rooted within the church, whereas female authors spend more time discussing spirituality in a personal, more secular language. Similarly, why should so many anatomical terms rank in the very top of male-weighted features, and are they literal expressions of physicality, or rooted in metaphorical usage? Clearly, these thematic groupings cannot be taken as definitive, universal statements about gendered authorship, but they are clearly identifiable trends that provide a neat snapshot of some basic differences between male and female authors, while suggesting potentially fruitful areas for further analysis, either computer-assisted or using traditional methods. Scholars intrigued by these questions could narrow the context for a close reading by refining the text mining analysis, focusing on questions such as which authors and works best exemplify the discovered trends, and which provide exceptions and counter-examples.</p>
        </div>
        <div>
            <head>Conclusion</head>
            <p>Our research demonstrates the utility of using support vector machine models to find contrasting features of male and female writing by interrogating the trained models to identify patterns of word usage that distinguish the gendered corpora. We found little advantage to using lemmatized forms of words as our features and a significant disadvantage to using parts of speech, and therefore used the surface forms of the words for the bulk of our research, achieving accuracies in classification between 80% and 90%. Of the words found to be most useful in distinguishing male and female writing, several distinct functional and semantic groupings were identified. The more personal and emotional frame of reference found in female authors' writing by Olsen in his statistical analysis of the same corpus was supported by our machine learning models. The marked male preference for determiners and female preference for personal pronouns and negative polarity items was a particularly promising finding, as it echoes very closely previous work by Argamon et al. <ptr target="#Argamon2003"/> on a different corpus in a different language (excerpts from the English-language British National Corpus). Among the other patterns we identified were a number of cohesive semantic groupings of words that were consistently highly weighted towards males or females across the wide time range of the corpus, such as anatomical and religious terms favored by males, and familial and emotional vocabulary favored by females. The close, contextual reading of a corpus of this magnitude could be the life's work or more of a dedicated scholar, with no guarantee that such trends would be salient enough to be noticed. Through the use of machine learning techniques, we can efficiently analyze vast swathes of texts and achieve results that are interesting and enlightening both in and of themselves, and as a spur to further research using other critical methods.</p>
            
            <p>
                <!-- <figure id="table03"><graphic url="resources/images/table03.jpg"/><figDesc>Screenshot of the table of male and female features</figDesc><caption>Table 2. Weights have been scaled to 10,000 times their original values for ease of reading</caption></figure> -->
                <table id="table05"><label>Weights have been scaled to 10,000 times their original values for ease of reading</label>
                    <row role="label"><cell>Male Features</cell><cell></cell><cell>Female Features</cell><cell></cell></row>
<row role="label"><cell>Word</cell><cell>Weight</cell><cell>Word</cell><cell>Weight</cell></row>
<row><cell>qui</cell><cell>3.032</cell><cell>elle</cell><cell>-4.270</cell></row>
<row><cell>un</cell><cell>2.706</cell><cell>ne</cell><cell>-2.768</cell></row>
<row><cell>à</cell><cell>2.568</cell><cell>vous</cell><cell>-2.256</cell></row>
<row><cell>le</cell><cell>2.512</cell><cell>pas</cell><cell>-1.812</cell></row>
<row><cell>des</cell><cell>2.392</cell><cell>et</cell><cell>-1.594</cell></row>
<row><cell>du</cell><cell>1.993</cell><cell>avec</cell><cell>-1.435</cell></row>
<row><cell>les</cell><cell>1.847</cell><cell>mais</cell><cell>-1.433</cell></row>
<row><cell>au</cell><cell>1.598</cell><cell>lui</cell><cell>-1.365</cell></row>
<row><cell>monsieur</cell><cell>1.396</cell><cell>était</cell><cell>-1.346</cell></row>
<row><cell>est</cell><cell>1.302</cell><cell>si</cell><cell>-1.245</cell></row>
<row><cell>deux</cell><cell>1.264</cell><cell>avait</cell><cell>-1.178</cell></row>
<row><cell>de</cell><cell>1.250</cell><cell>me</cell><cell>-1.127</cell></row>
<row><cell>sur</cell><cell>1.033</cell><cell>ma</cell><cell>-1.069</cell></row>
<row><cell>a</cell><cell>0.953</cell><cell>pour</cell><cell>-0.952</cell></row>
<row><cell>homme</cell><cell>0.884</cell><cell>sans</cell><cell>-0.811</cell></row>
<row><cell>par</cell><cell>0.867</cell><cell>moi</cell><cell>-0.794</cell></row>
<row><cell>ce</cell><cell>0.746</cell><cell>consuelo</cell><cell>-0.779</cell></row>
<row><cell>madame</cell><cell>0.690</cell><cell>quand</cell><cell>-0.779</cell></row>
<row><cell>d'</cell><cell>0.656</cell><cell>bien</cell><cell>-0.702</cell></row>
<row><cell>une</cell><cell>0.594</cell><cell>roi</cell><cell>-0.676</cell></row>
<row><cell>ces</cell><cell>0.590</cell><cell>l'</cell><cell>-0.666</cell></row>
<row><cell>ses</cell><cell>0.586</cell><cell>il</cell><cell>-0.614</cell></row>
<row><cell>dont</cell><cell>0.566</cell><cell>beaucoup</cell><cell>-0.570</cell></row>
<row><cell>quelque</cell><cell>0.554</cell><cell>n'</cell><cell>-0.560</cell></row>
<row><cell>femme</cell><cell>0.535</cell><cell>henri</cell><cell>-0.543</cell></row>
<row><cell>ils</cell><cell>0.528</cell><cell>m'</cell><cell>-0.535</cell></row>
<row><cell>où</cell><cell>0.511</cell><cell>jamais</cell><cell>-0.523</cell></row>
<row><cell>tems</cell><cell>0.496</cell><cell>reine</cell><cell>-0.513</cell></row>
<row><cell>charles</cell><cell>0.493</cell><cell>je</cell><cell>-0.482</cell></row>
<row><cell>ou</cell><cell>0.487</cell><cell>princesse</cell><cell>-0.479</cell></row>
<row><cell>autre</cell><cell>0.451</cell><cell>toujours</cell><cell>-0.470</cell></row>
<row><cell>aux</cell><cell>0.449</cell><cell>car</cell><cell>-0.465</cell></row>
<row><cell>yeux</cell><cell>0.429</cell><cell>ai</cell><cell>-0.462</cell></row>
<row><cell>main</cell><cell>0.417</cell><cell>votre</cell><cell>-0.459</cell></row>
<row><cell>fit</cell><cell>0.392</cell><cell>esprit</cell><cell>-0.453</cell></row>
<row><cell>leurs</cell><cell>0.386</cell><cell>avais</cell><cell>-0.447</cell></row>
<row><cell>quelques</cell><cell>0.384</cell><cell>m</cell><cell>-0.444</cell></row>
<row><cell>leur</cell><cell>0.380</cell><cell>personne</cell><cell>-0.430</cell></row>
<row><cell>cette</cell><cell>0.379</cell><cell>albert</cell><cell>-0.419</cell></row>
<row><cell>fait</cell><cell>0.379</cell><cell>temps</cell><cell>-0.400</cell></row>
<row><cell>après</cell><cell>0.374</cell><cell>mon</cell><cell>-0.393</cell></row>
<row><cell>avois</cell><cell>0.374</cell><cell>bonne</cell><cell>-0.383</cell></row>
<row><cell>reste</cell><cell>0.363</cell><cell>être</cell><cell>-0.381</cell></row>
<row><cell>mille</cell><cell>0.355</cell><cell>dans</cell><cell>-0.379</cell></row>
<row><cell>même</cell><cell>0.327</cell><cell>ça</cell><cell>-0.371</cell></row>
<row><cell>saint</cell><cell>0.326</cell><cell>se</cell><cell>-0.365</cell></row>
<row><cell>fille</cell><cell>0.324</cell><cell>liberté</cell><cell>-0.364</cell></row>
<row><cell>francs</cell><cell>0.309</cell><cell>la</cell><cell>-0.360</cell></row>
<row><cell>tout</cell><cell>0.307</cell><cell>âme</cell><cell>-0.356</cell></row>
<row><cell>lettre</cell><cell>0.299</cell><cell>très</cell><cell>-0.356</cell></row>
<row><cell>étoit</cell><cell>0.298</cell><cell>enfants</cell><cell>-0.349</cell></row>
<row><cell>entre</cell><cell>0.287</cell><cell>peut</cell><cell>-0.347</cell></row>

                </table>
                <table id="table06"><label>Weights have been scaled to 10,000 times their original values for ease of reading.</label>
                    <row role="label"><cell>Male Features</cell><cell></cell><cell>Female Features</cell><cell></cell></row>
<row role="label"><cell>Word</cell><cell>Weight</cell><cell>Word</cell><cell>Weight</cell></row>
<row><cell>qui</cell><cell>3.043</cell><cell>elle</cell><cell>-4.291</cell></row>
<row><cell>un</cell><cell>2.716</cell><cell>ne</cell><cell>-2.780</cell></row>
<row><cell>à</cell><cell>2.578</cell><cell>vous</cell><cell>-2.265</cell></row>
<row><cell>le</cell><cell>2.522</cell><cell>pas</cell><cell>-1.820</cell></row>
<row><cell>des</cell><cell>2.400</cell><cell>et</cell><cell>-1.599</cell></row>
<row><cell>du</cell><cell>2.000</cell><cell>avec</cell><cell>-1.441</cell></row>
<row><cell>les</cell><cell>1.856</cell><cell>mais</cell><cell>-1.439</cell></row>
<row><cell>au</cell><cell>1.603</cell><cell>lui</cell><cell>-1.366</cell></row>
<row><cell>monsieur</cell><cell>1.400</cell><cell>était</cell><cell>-1.348</cell></row>
<row><cell>est</cell><cell>1.305</cell><cell>si</cell><cell>-1.250</cell></row>
<row><cell>deux</cell><cell>1.269</cell><cell>avait</cell><cell>-1.179</cell></row>
<row><cell>de</cell><cell>1.252</cell><cell>me</cell><cell>-1.127</cell></row>
<row><cell>sur</cell><cell>1.037</cell><cell>ma</cell><cell>-1.072</cell></row>
<row><cell>a</cell><cell>0.956</cell><cell>pour</cell><cell>-0.956</cell></row>
<row><cell>homme</cell><cell>0.888</cell><cell>sans</cell><cell>-0.814</cell></row>
<row><cell>par</cell><cell>0.870</cell><cell>moi</cell><cell>-0.795</cell></row>
<row><cell>ce</cell><cell>0.749</cell><cell>quand</cell><cell>-0.782</cell></row>
<row><cell>madame</cell><cell>0.690</cell><cell>bien</cell><cell>-0.706</cell></row>
<row><cell>d'</cell><cell>0.657</cell><cell>roi</cell><cell>-0.679</cell></row>
<row><cell>une</cell><cell>0.597</cell><cell>l'</cell><cell>-0.668</cell></row>
<row><cell>ces</cell><cell>0.592</cell><cell>il</cell><cell>-0.621</cell></row>
<row><cell>ses</cell><cell>0.587</cell><cell>beaucoup</cell><cell>-0.572</cell></row>
<row><cell>dont</cell><cell>0.568</cell><cell>n'</cell><cell>-0.564</cell></row>
<row><cell>quelque</cell><cell>0.555</cell><cell>henri</cell><cell>-0.549</cell></row>
<row><cell>femme</cell><cell>0.537</cell><cell>m'</cell><cell>-0.536</cell></row>
<row><cell>ils</cell><cell>0.530</cell><cell>jamais</cell><cell>-0.526</cell></row>
<row><cell>où</cell><cell>0.513</cell><cell>reine</cell><cell>-0.515</cell></row>
<row><cell>tems</cell><cell>0.498</cell><cell>je</cell><cell>-0.483</cell></row>
<row><cell>charles</cell><cell>0.495</cell><cell>princesse</cell><cell>-0.481</cell></row>
<row><cell>ou</cell><cell>0.488</cell><cell>toujours</cell><cell>-0.471</cell></row>
<row><cell>autre</cell><cell>0.452</cell><cell>car</cell><cell>-0.466</cell></row>
<row><cell>aux</cell><cell>0.450</cell><cell>ai</cell><cell>-0.462</cell></row>
<row><cell>yeux</cell><cell>0.430</cell><cell>votre</cell><cell>-0.460</cell></row>
<row><cell>main</cell><cell>0.418</cell><cell>esprit</cell><cell>-0.455</cell></row>
<row><cell>fit</cell><cell>0.394</cell><cell>avais</cell><cell>-0.447</cell></row>
<row><cell>leurs</cell><cell>0.387</cell><cell>m</cell><cell>-0.445</cell></row>
<row><cell>quelques</cell><cell>0.386</cell><cell>personne</cell><cell>-0.431</cell></row>
<row><cell>cette</cell><cell>0.381</cell><cell>albert</cell><cell>-0.420</cell></row>
<row><cell>leur</cell><cell>0.381</cell><cell>temps</cell><cell>-0.402</cell></row>
<row><cell>fait</cell><cell>0.380</cell><cell>mon</cell><cell>-0.392</cell></row>
<row><cell>après</cell><cell>0.375</cell><cell>bonne</cell><cell>-0.385</cell></row>
<row><cell>avois</cell><cell>0.375</cell><cell>être</cell><cell>-0.380</cell></row>
<row><cell>reste</cell><cell>0.364</cell><cell>dans</cell><cell>-0.378</cell></row>
<row><cell>mille</cell><cell>0.356</cell><cell>ça</cell><cell>-0.375</cell></row>
<row><cell>même</cell><cell>0.329</cell><cell>se</cell><cell>-0.366</cell></row>
<row><cell>saint</cell><cell>0.327</cell><cell>liberté</cell><cell>-0.365</cell></row>
<row><cell>fille</cell><cell>0.325</cell><cell>la</cell><cell>-0.358</cell></row>
<row><cell>francs</cell><cell>0.311</cell><cell>très</cell><cell>-0.358</cell></row>

                </table>
                <!-- <figure id="table04"><graphic url="resources/images/table04.jpg"/><figDesc>Screenshot of the table of male and female features</figDesc><caption>Table 3. Weights have been scaled to 10,000 times their original values for ease of reading.</caption></figure> -->
            </p>
        </div>
    </text>
    <listBibl>
        <bibl id="Argamon2003"><label>Argamon 2003</label>Argamon, S., Koppel, M., Fine, J. and Shimoni, A. <title rend="quotes">Gender, Genre, and Writing Style in Formal Written Texts</title>. <title rend="italic">Text</title> 23(3), August 2003.</bibl>
        <bibl id="ARTFL2008"><label>ARTFL 2008</label>ARTFL Technical Report: <title rend="quotes">Creating Vectors for Text Classification Machine Learning,</title> <ref target="http://artfl.uchicago.edu/TechReports/VectorsForTextClassification">http://artfl.uchicago.edu/TechReports/VectorsForTextClassification</ref></bibl>  
        <bibl id="Joachims1999"><label>Joachims 1999</label>Joachims 1999  Joachims, T., <title rend="quotes">Making large-Scale SVM Learning Practical</title>. <title rend="italic">Advances in Kernel Methods - Support Vector Learning</title>, B. Schölkopf and C. Burges and A. Smola (eds.). MIT Press, 1999.</bibl>
        <bibl id="Koppel2002"><label>Koppel 2002</label>Koppel, M., Shlomo A., and Shimoni, A., <title rend="quotes">Automatically Categorizing Written Texts by Author Gender</title>. <title rend="italic">Literary and Linguistic Computing</title> 17:4 (2002): 401-12.</bibl>
        <bibl id="Olsen2004"><label>Olsen 2004</label> Olsen, Mark. <title rend="quotes">Making Space: Women's Writing in France, 1600-1950</title>, ALLC/ACH 2004 Conference, Göteborg, Sweden. A slightly earlier version of this talk was presented to COCH/COSH 2004, Annnual Congress of the Social Sciences and Humanities, Winnipeg, Manitoba.</bibl>
        <bibl id="Olsen2005"><label>Olsen 2005</label>Olsen, Mark. <title rend="quotes">Écriture féminine: Searching for an Indefinable Practice?</title>. <title rend="italic">Literary and Linguistic Computing</title> 20, 2005, pp. 147-164.</bibl>
        <bibl id="PhiloMine2007"><label>PhiloMine 2007</label>The ARTFL Project, <ref target="http://philologic.uchicago.edu/philomine/rationale.html">http://philologic.uchicago.edu/philomine/rationale.html</ref></bibl>
         <bibl id="Stein"><label>Stein</label>Achim Stein, The University of Stuttgart. <ref target="http://www.uni-stuttgart.de/lingrom/stein/forschung/resource.html">http://www.uni-stuttgart.de/lingrom/stein/forschung/resource.html</ref></bibl> 
       <bibl id="SVMLight"><label>SVMLight</label>Joachims, T., Cornell University. <ref target="http://svmlight.joachims.org/">http://svmlight.joachims.org/</ref></bibl>
      <bibl id="Tannen1994"><label>Tannen 1994</label>Tannen, Deborah. <title rend="quotes">Gender and Discourse</title>. New York: Oxford University Press, 1994.</bibl>
         <bibl id="TreeTagger"><label>TreeTagger</label>Institute for Computational Linguistics of the University of Stuttgart. <ref target="http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/">http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/</ref></bibl>
    </listBibl>
</DHQarticle>

