Change search
ReferencesLink to record
Permanent link

Direct link
Link Extraction for Crawling Flash on the Web
Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
2015 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The set of web pages not reachable using conventional web search engines is usually called the hidden or deep web. One client-side hurdle for crawling the hidden web is Flash files.

This thesis presents a tool for extracting links from Flash files up to version 8 to enable web crawling. The files are both parsed and selectively interpreted to extract links. The purpose of the interpretation is to simulate the normal execution of Flash in the Flash runtime of a web browser. The interpretation is a low level approach that allows the extraction to occur offline and without involving automation of web browsers. A virtual machine is implemented and a set of limitations is chosen to reduce development time and maximize the coverage of interpreted byte code.

Out of a test set of about 3500 randomly sampled Flash files the link extractor found links in 34% of the files. The resulting estimated web search engine coverage improvement is almost 10%.

Place, publisher, year, edition, pages
2015. , 54 p.
Keyword [en]
Flash, crawling, spidering, deep web, hidden web, virtual machine, interpretation
National Category
Computer Science
URN: urn:nbn:se:liu:diva-117604ISRN: LIU-IDA/LITH-EX-A--15/11--SEOAI: diva2:810000
External cooperation
Picsearch AB (publ)
Subject / course
Computer science
Available from: 2015-05-11 Created: 2015-05-05 Last updated: 2015-05-11Bibliographically approved

Open Access in DiVA

fulltext(613 kB)74 downloads
File information
File name FULLTEXT01.pdfFile size 613 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Antelius, Daniel
By organisation
Department of Computer and Information ScienceThe Institute of Technology
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 74 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 209 hits
ReferencesLink to record
Permanent link

Direct link