Multilingual Logfile Analysis (LogCLEF)

The LogCLEF multilingual log analysis evaluation initiative has created the first long-term standard collection for evaluation purposes in the area of log analysis.
The LogCLEF 2011 lab it is the continuation of the past two editions: as a pilot task in CLEF 2009, and a workshop in CLEF 2010.

LogCLEF 2011 is one of the benchmarking activities of CLEF 2011 that will take place in Amsterdam in September 2011.

Deadline for registration is May 2011. Registration details will follow.
---> Goal <---

The research goal of LogCLEF is the analysis and classification of queries, the definition of success of a search in order to understand search behaviour in multilingual contexts.

Another important goal of LogCLEF is the creation of a community which includes groups from the information retrieval community and related areas such as data mining. To foster communication between these up to now largely separated groups, the creation of a standard evaluation resource is a first step.

The exchange of systems and components as well as the sharing of heterogeneous annotation of the logs needs to follow in order to advance the state of the art in this research area.

---> Tasks <---

We present three tasks, based on the exchange of ideas and proposals among the participants during the last LogCLEF 2010 workshop:

1. Language identification task: participants are required to recognize the actual language of the query submitted. Annotated resources manually generated by participants of previous editions to create a basic set of ground-truth data will be made available to participants (other manually generated resources will be created during the first months of 2011). This ground truth will be used, for example, to evaluate the automatic language recognition algorithms.

2. Query classification: participants are required to annotate each query with a label which represents a category of interest. A proposal for an initial set of categories of interest is:
•    Person (i.e. Leonardo Da Vinci)
•    Geographic Location (i.e. Mont Saint Michelle)
•    Event (i.e. Revolución francesa)
•    Work title (i.e. Divina Commedia)
•    Domain Specific (i.e. Panthera Pardua)
•    Other (i.e. ISBN)

3. Success of a query: participants are required to study the trend of the success of a search. The success can be defined in terms of time spent on a page, number of clicked items, actions performed during the browsing of the result list.
A common definition of user session will be given to participants. Participants are also encouraged to carry out two subtasks:  a) query re-finding, when a user clicks an item following a search, and then later clicks on the same item via another search; b) query refinement, when a user starts with a query and then the following queries in the same session are a generalization/specification/shift of the original one.

---> Log Data collection <---

The European Library (TEL) dataset

The TEL search/action logs are stored in a relational table and contain different types of actions and choices of the user. Each record represents a user action  and the most significant fields:

A numeric id, for identifying registered users or “guest” otherwise;

•	User’s IP address;
•	An automatically generated alphanumeric, identifying sequential actions of the same user (sessions) ;
•	Query contents;
•	Name of the action that a user performed;
•	The corresponding collection’s alphanumeric id;
•	Date and time of the action’s occurrence.

Three years and a half of log data will be released:

-January 2007-June 2008, 1,900,000 records (distributed at LogCLEF 2009)
-January 2009-December 2009, 760,000 records (distributed at LogCLEF 2010)
-January 2010-December 2010,  950,000 records (to be distributed at LogCLEF 2011)

Sogou dataset

The Sogou query logs contain queries to the Chinese Sogou search engine. The data contains:

•	a user ID,
•	the query terms,
•	URL in the result ranking, and
•	user click information.   

Deutscher Bildungserver (DBS) dataset

The quality controlled "Deutscher Bildungsserver" is a clearinghouse for educational resources on the Web. It also contains content provided by the DIPF as well as descriptions and reviews on Web sites on education. The Internet resources (web sites) are described, checked for their quality, manually indexed and classified.

The logs were collected in the time between September and November of 2009. The logs are server logs in standards format in which the searches and the results viewed can be observed. An excerpt is shown in table 2. The logs have been anonymized by partially obscuring the IP addresses of users.

The two upper levels of server names or IP addresses have been hashed. This allows the reconstruction of sessions within the data. Note that accesses by search engine bots are still within the logs. The logs allow to observe two types of user queries:

•	queries in search engines (in the referrer when DBS files were found using a search engine)
•queries within the DBS (see query parameters in metasuche/qsuche)

---> Timeline <---

* March-April 2011: data release (guidelines, corpus, training topics)

* April-May 2011: participants will be required to manually annotate a set of queries to produce a test set of topics

* June 2011: submission deadline for all the tasks

* July 2011: evaluating submissions, make results available

* August 2011: submission of Notebook Papers to CLEF 2011

---> Organizers <---

Giorgio Maria Di Nunzio (University of Padua, IT)

Johannes Leveling (Dublin City University, IR)

Thomas Mandl (University of HIldesheim, DE)

Steering Committee

Jim Jansen (University of Pennsylvania, US)

Jaap Kamps (University of Amsterdam, NL)

Inderjeet Mani (Mitre Corp, US)

