Extracting Knowledge for Cultural Heritage Knowledge Base Population
2013 (English)Doctoral thesis, monograph (Other academic)
The entity-oriented description of the world is a major, current trend motivated by the need for semantic services that can support the human need of finding information, learning and discovering new knowledge, and broadening the existing knowledge horizons. Entities, managed in semantic knowledge bases, have the potential to be the backbone for these new and innovative services. Therefore, automatically extracting facts from various data sources and populating knowledge bases a challenge studied in this work.
This thesis proposes methods for knowledge extraction for the cultural heritage domain. Extracting knowledge from the cultural heritage metadata is by no means a trivial task and there are often problems with missing or ambiguous information. Therefore, an inherent part of this work is dedicated to developing pattern-based techniques to extract knowledge from natural language documents to complement and supplement the knowledge we extract from metadata. However, the proposed framework is not limited to only work in conjunction with metadata extraction – it additionally supports independent, continuous mode operation, i.e. patterns learned during extraction are used to subsequently mine new knowledge.
In summary, the main contributions of this thesis are:
- FRBR-ML: a generic framework for exploiting metadata which includes: (i) a method to extract entities, attributes and relationships from existing legacy metadata, (ii) novel techniques for correction, enhancement and semantic enrichment of the metadata, and (iii) metrics to assess the quality of extraction.
- SPIDER: a prototype that supports extraction of relational facts at Web-scale. Contrary to most knowledge extraction approaches, we tackle the problem of uniquely identifying entities both to extend their list of spelling forms and to facilitate the matching to LOD entities. Furthermore, in addition to the flexible pattern definition scheme, SPIDER enables a provenance-aware extraction method, which prudently refines extracted facts by considering the PageRank and SpamScore as well as the relevance score of the source document.
- KIEV: a prototype that takes the development of SPIDER into the next stage, namely by enabling verification of facts using two evidence-based techniques: (i) classification to check the type of relationship with a machine learning approach, and (ii) linking to discover local entity’s correspondence in another data source was leveraged using existing semantic knowledge bases.
- FRBRpedia: a prototype that is developed to utilize the attribute-oriented linking of local entity to the corresponding entities in external semantic knowledge bases. As one of the most basic tasks of knowledge base population, linking demonstrates the power of Linked Data applications. Finally, linking is commonly seen as a required step for putting the data on LOD.
The methods and solutions proposed in this thesis provide a solid foundation for automatically populating knowledge bases using wide range of sources. The feasibility of the approaches presented have been tested through experimental evaluation using real-world datasets. A general conclusion is that complementing knowledge extraction from metadata with the external sources results in less amount of missing and ambiguous information and in a more complete knowledge base.
Place, publisher, year, edition, pages
Doctoral theses at NTNU, ISSN 1503-8181 ; 2013:289
Knowledge bases, Knowledge extraction, Metadata, FRBR, Entity matching
Information and communication systems
IdentifiersURN: urn:nbn:no:ntnu:diva-23381ISBN: 978-82-471-4709-2 (printed ver.)ISBN: 978-82-471-4710-8 (electronic ver.)OAI: oai:DiVA.org:ntnu-23381DiVA: diva2:662303