Digitala Vetenskapliga Arkivet

Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
2019 (Engelska)Självständigt arbete på grundnivå (kandidatexamen), 10 poäng / 15 hpStudentuppsats (Examensarbete)
Abstract [en]

The written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and explore a method for gender classification of text. In this study, two machine learning classifiers, Random Forest (RF) and Support Vector Machine (SVM), are applied to the gender classification of Wikipedia biographies in two languages, English and French. Their performance is evaluated and compared. The 500 most important words (features) are listed for each of the classifiers.A short review is given on the theoretic foundations of text classification, and a detailed description on how the datasets are built, what tools are used, and why. The datasets used are built from the first 5 paragraphs in each biography, with only nouns, verbs, adjectives and adverbs remaining. Feature ranking is also applied, where the top tenth of the features are kept.Performance is measured using the F0:5-score. The comparison shows that the RF and SVM classifiers' performance are close to each other, but that the classifiers perform worse on the French set than on the English. Initial performance scores range from 0.82 to 0.86, but they drop drastically when the most important features are removed from the set. A majority of the top most important features are nouns related to career and family roles, in both languages.The results show that there are indeed some semantic differences in language depending on the gender of the person described. Whether these depend on the writers' biased views, an unequal gender distribution of real world contexts, such as careers, or if these differences depend on how the datasets were built, is not clear.

Ort, förlag, år, upplaga, sidor
2019. , s. 60
Serie
UMNAD ; 1191
Nationell ämneskategori
Teknik och teknologier
Identifikatorer
URN: urn:nbn:se:umu:diva-163371OAI: oai:DiVA.org:umu-163371DiVA, id: diva2:1351984
Utbildningsprogram
Kandidatprogrammet i Datavetenskap
Handledare
Examinatorer
Tillgänglig från: 2019-09-17 Skapad: 2019-09-17 Senast uppdaterad: 2019-09-17Bibliografiskt granskad

Open Access i DiVA

fulltext(444 kB)384 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 444 kBChecksumma SHA-512
2d578fa1964e993d9cba7bdd61c773ba27ade5cc0dc04f93a40cb558db0a7814c1deebd547fc0e6ae53c5462fc9a91eece26c170284d59549bbf8b824a59c93e
Typ fulltextMimetyp application/pdf

Av organisationen
Institutionen för datavetenskap
Teknik och teknologier

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 384 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 330 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf