Institute for German Language

Tutorial day

On November 8th, 2016 several tutorials will take place on the premises of the Institute for the German Language. The participation in the tutorials is free and is open for both for the regular participants of GaC 2016 and all other interested researchers and students, provided there are still any places available. To register for a tutorial, please click on the " Register" button.

Overview of the Programme

TITLE TUTOR(S) LANGUAGE ROOM
09:00 - 10:30
Working with Web Corpora
(Part 1)
Felix Bildhauer and Roland Schäfer (Mannheim / Berlin) English Auditorium
Register
10:30 - 10:45
COFFEE BREAK
10:45 - 12:15
Working with Web Corpora
(Part 2)
Felix Bildhauer and Roland Schäfer (Mannheim / Berlin) English Auditorium
12:15 - 14:00
LUNCH BREAK
14:00 - 15:30
InterCorp: Exploring a Multilingual Parallel Corpus (Abstract)
Presentation
Alexandr Rosen (Prag)
English Auditorium
Register
PARALLEL
Introduction to Corpus Analysis with KorAP Nils Diewald and Eliza Margaretha (Mannheim) English / German Room 1.28
Register
15:30 - 15:45
COFFEE BREAK
15:45 - 17:15
Visualisierung linguistischer Daten mit der freien Grafik- und Statistikumgebung R (Part 1)
Sandra Hansen-Morath  and Sascha Wolfer (Mannheim)
English / German Auditorium
Register
17:15 - 17:30
COFFEE BREAK
17:30 - 19:00
Visualisierung linguistischer Daten mit der freien Grafik- und Statistikumgebung R (Part 2) Sandra Hansen-Morath and Sascha Wolfer (Mannheim) English / German Auditorium
20:00
GET-TOGETHER at Wirtshaus UHLAND! Register


Working with Web Corpora

Felix Bildhauer und Roland Schäfer (Mannheim / Berlin)

Web corpora (huge, post-processed collections of web pages) provide an increasingly important source of data for linguistic research, thanks to their size, content, and availability. The last decade has seen important developments in the construction of web corpora, and the current generation surpasses its predecessors in cleanliness, level and quality of linguistic annotation and enrichment with meta data. At the same time, web corpora have peculiarities (such as sampling biases, duplication, non-standard orthography and language, lack of some meta data) that may discourage linguists from using them. Linguists working with web corpora should at all times be aware of these limitations.

This workshop will start with a brief introduction to the making of web corpora, discussing some of the most important questions of design and processing, including linguistic annotation. The main focus of the workshop, however, is on practical questions that frequently arise from a linguist's perspective. In particular, we will discuss what web corpora can (and cannot) do for linguists in their daily corpus linguistic work, regarding such issues as reliability of annotation, availability of meta data, data integrity and representativeness and practical limitations of typical query engines. Much of the workshop will be hands-on examples and exercises, and we will introduce practical solutions and workarounds for a number of frequently encountered problems. For maximal benefit, participants should bring their own laptop computer.

Roland Schäfer and Felix Bildhauer have been involved in building corpora from the web since 2011. They have created some of the world's largest web corpora for a variety of languages, including German.


InterCorp: Exploring a Multilingual Parallel Corpus

Alexandr Rosen (Prag)

After a brief introduction of parallel corpora, focusing on their specifics in comparison to standard monolingual corpora, and an overview of those publicly available, we take a closer look at InterCorp, a part of the Czech National Corpus. InterCorp has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages, with a focus on Czech, but also a substantial share of English, Spanish, German, French, Croatian, Polish, Dutch and a number of other languages. Its core part includes mainly fiction, complemented by legal and journalistic texts, parliament proceedings and film subtitles. The texts are sentence-aligned, tagged (in 23 languages) and lemmatized (in 20 languages). In the practical, hands-on part of the tutorial, we learn how to:

  • Select languages and texts
  • Make a query including forms, lemmas and tags
  • Specify view options, sorting, sampling, filtering, saving results
  • Build customized subcorpora
  • Use frequency statistics, identify collocations
  • Ask for translations of a lexeme, based on the corpus or its part

Finally, we will discuss some challenges and prospects of the on-going project. Experience with the use of corpus search tools will be useful, as well as the registration as a user of the Czech National Corpus.

Download the presentation as a PDF file.


Introduction to Corpus Analysis with KorAP

Nils Diewald und Eliza Margaretha (Mannheim)

In recent years, due to technical advances and accessibility of resources through the world wide web, the field of corpus analysis gained new attention in providing tools to deal with very large corpora. DeReKo, the German Reference Corpus, for example, has grown beyond 25 billion words alone (Kupietz and Lüngen, 2014). Additional layers of linguistic annotations increase these amounts of data even further, pushing popular applications for corpus analysis like IMS Corpus Workbench (Evert and Hardie, 2011), Annis (Zeldes et al., 2009) or COSMAS II (Bodmer, 1996) to their limits.

KorAP is a web-based corpus analysis platform, developed with a focus on scalability, flexibility, and sustainability - and with the intention to replace COSMAS II as the main access point to DeReKo in the future. KorAP is capable of dealing with very large, multiple annotated, and heterogeneously licensed text collections. It supports researchers by providing a wide range of query constructs and the ad-hoc creation of virtual corpora. In this tutorial, the developers will introduce KorAP for corpus analysis. Starting with a brief description of the current state of development and the architecture of the system, the participants will be able to do their own research using KorAP in a hands-on session.

Following a short starting guide, all participants will be able to search for linguistic phenomena using KorAP, starting with simple sequences of words up to complex linguistic structures across multiple annotation layers. They will also be able to construct complex virtual corpora by means of meta data constraints, and make use of the built-in assisting tools. As KorAP supports multiple query languages like COSMAS II, ANNIS QL (Rosenfeld, 2010), or Poliqarp (Przepiórkowski et al., 2004; a variant of the popular CQP language), users known to these languages will easily be able to work with the new system. However, previous knowledge of corpus analysis platforms or corpus query languages is not necessary. To close the session, the developers would like to gather feedback on the current version of the software and discuss further improvements. For those interested in technical details of the KorAP system, the developers are open for questions afterwards.

The tutorial welcomes anyone interested in corpus analysis and corpus analysis software. Participants are requested to bring their own laptops for use in the hands-on session. A common browser in a current version should be pre-installed (e.g. Mozilla Firefox, Google Chrome).

Screenshot of KorAP

Literature

  • Bodmer, Franck (1996): Aspekte der Abfragekomponente von COSMAS II. LDV-INFO, 8:142‚ pp. 112-122.

  • Evert, Stefan / Hardie, Andrew (2011): Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In: Proceedings of the Corpus Linguistics 2011 Conference, Birmingham, UK.

  • Kupietz, Marc / Lüngen, Harald (2014): Recent Developments in DeReKo. In Calzolari, Nicoletta et al. (eds.): Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik: ELRA, 2378-2385.

  • Przepiórkowski, Adam / Krynicki, Zygmunt / Dębowski, Łukasz / Woliński, Marcin / Janus, Daniel /Bański, Piotr (2004): A search tool for corpora with positional tagsets and ambiguities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004).

  • Rosenfeld, Viktor (2010): An implementation of the Annis 2 query language. Technical report, Humboldt-Universität zu Berlin.

  • Zeldes, Amir / Ritz, Julia / Lüdeling, Anke / Chiarcos, Christian (2009): ANNIS: A Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus Linguistics 2009, Liverpool, UK.


Visualisierung linguistischer Daten mit der freien Grafik- und Statistikumgebung R

Sandra Hansen-Morath und Sascha Wolfer (Mannheim)

R ist eine flexible und freie Entwicklungsumgebung zur Umsetzung von statistischen Analysen, die zahlreiche Optionen zur Datenvisualisierung bereit hält und sehr gut für große Datensätze geeignet ist. Unser Workshop vermittelt einen stark anwendungsorientierten Einstieg in das Programm und legt mit Hilfe von vielen praktischen Übungen und linguistischen Anwendungsbeispielen die Grundlagen für ein eigenständiges Weiterentwickeln der eigenen Fähigkeiten im Umgang mit der Software. Wir werden elementare explorative Visualisierungen vorstellen und in die Logik des Basis-Grafiksystems von R einführen. Darüber hinaus werden wir inferenzstatistische und multivariate Statistiken vorstellen und zeigen, wie man die Ergebnisse dieser Verfahren visuell darstellen kann. Wir werden außerdem vorstellen, wie in R interaktive Grafiken erstellt werden können.

News and Events

Important Dates

  • 31.05.2016: Deadline for abstract submission
  • 17.06.2016: Extended deadline for submissions
  • 15.07.2016: Notification of acceptance
  • 08.11.2016: Tutorial day
  • 09-11.11.2016: Conference

Confirmed Keynote Speakers

Conference Poster

GaC 2016 Conference poster

Previous Editions