Digitala Vetenskapliga Arkivet

Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
2019 (engelsk)Independent thesis Basic level (degree of Bachelor), 10 poäng / 15 hpOppgave
Abstract [en]

The written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and explore a method for gender classification of text. In this study, two machine learning classifiers, Random Forest (RF) and Support Vector Machine (SVM), are applied to the gender classification of Wikipedia biographies in two languages, English and French. Their performance is evaluated and compared. The 500 most important words (features) are listed for each of the classifiers.A short review is given on the theoretic foundations of text classification, and a detailed description on how the datasets are built, what tools are used, and why. The datasets used are built from the first 5 paragraphs in each biography, with only nouns, verbs, adjectives and adverbs remaining. Feature ranking is also applied, where the top tenth of the features are kept.Performance is measured using the F0:5-score. The comparison shows that the RF and SVM classifiers' performance are close to each other, but that the classifiers perform worse on the French set than on the English. Initial performance scores range from 0.82 to 0.86, but they drop drastically when the most important features are removed from the set. A majority of the top most important features are nouns related to career and family roles, in both languages.The results show that there are indeed some semantic differences in language depending on the gender of the person described. Whether these depend on the writers' biased views, an unequal gender distribution of real world contexts, such as careers, or if these differences depend on how the datasets were built, is not clear.

sted, utgiver, år, opplag, sider
2019. , s. 60
Serie
UMNAD ; 1191
HSV kategori
Identifikatorer
URN: urn:nbn:se:umu:diva-163371OAI: oai:DiVA.org:umu-163371DiVA, id: diva2:1351984
Utdanningsprogram
Bachelor of Science Programme in Computing Science
Veileder
Examiner
Tilgjengelig fra: 2019-09-17 Laget: 2019-09-17 Sist oppdatert: 2019-09-17bibliografisk kontrollert

Open Access i DiVA

fulltext(444 kB)384 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 444 kBChecksum SHA-512
2d578fa1964e993d9cba7bdd61c773ba27ade5cc0dc04f93a40cb558db0a7814c1deebd547fc0e6ae53c5462fc9a91eece26c170284d59549bbf8b824a59c93e
Type fulltextMimetype application/pdf

Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 384 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 330 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf