Course contents *
Internet contains a huge amount of information, which is rapidly growing at an ever increasing pace. People, organizations and corporations from the whole world are adding different types of information to the web continuously in various languages. The web therefore contains potentially very interesting and valuable information. This course will investigate various techniques for processing the Web in order to extract such information, refine it and make it more structured, thus making it both more valuable and accessible. These techniques are often referred to as web mining techniques.
The domains within the Internet that we will study are databases, e-commerce web sites, wikis, virtual communities and blogs. Semantic web and Web 2.0 are two other concepts that are relevant for the course. Web mining is considered to contain three main areas, namely web content mining, web structure mining and web usage mining. Web structure mining is closely related to information search techniques, and web usage mining to opinion mining or sentiment analysis. Also related is the automatic construction of sociograms. Web content mining can for example be used to find the cheapest airline tickets, by monitoring all web based databases of all airlines in order to attempt to find the lowest common denominator of all databases.
Web mining techniques explored in the course are human language technology, machine learning, statistics, information retrieval and extraction, text mining, text summarization, automatic classification, clustering, wrapper induction, normalization of data, match cardinality of data in different databases, interface matching, schema matching, sentiment analysis, opinion mining, extraction of comparatives, forensic linguistics etc.
Intended learning outcomes *
The course intends to give an insight into techniques for data mining applied on Internet related data, and for what they can be used. After the course is finished the student should be able to:
1. Identify and differentiate between application areas for web content mining, web structure mining and web usage mining.
2. Describe key concepts such as deep web, surface web, semantic web, web log, hypertext, social network, information synthesis, corpora and evaluation measures such as precision and recall.
3. Discuss the use of methods and techniques such as word frequency and co-occurrence statistics, normalization of data, machine learning, clustering, vector space models and lexical semantics.
4. In detail explain the architecture and main algorithms commonly used by web mining applications.
5. Appropriately select between different approaches and techniques of web mining for e.g. sentiment analysis, targeted marketing, linguistic forensics, topic/trend-detection-tracking and multi-document summarization (information aggregation).
6. Apply human language technology tools such as tokenizers, stemmers, part-of-speech taggers, noun phrase chunkers and shallow parsers on different types of web content gathered from for instance e-commerce sites.
7. Perform analysis of linguistically processed data using a suitable statistical classifier.
8. Set requirements to, compare and assess the quality of existing web mining tools.
9. Analyze and explain what web mining problems are satisfiably solved, what is worked upon at the research frontier and what still lies beyond the current state-of-the-art.
10. Independently solve a well defined practical web mining problem using tools and techniques introduced in the course, or analyze it through theoretical studies seeking information beyond the course literature.
11. Convey the outcome own work on web mining orally and in written form to fellow peers using relevant and appropriate terminology.
Credits: 7,5 hp
Lectures: approx. 12 lectures x 2 hours
Lab exercises: approx. 5 occasions x 3 hours
Project and seminar task: approx. 1 lecture x 2 hours