Workshop & Thematic Session – new deadline: April 22nd

Two events devoted to the use of corpora in the investigations of mediated discourse including, but not limited to translation, interpreting and non-native language use coming up soon! Join us in Poznań or Cardiff  or both!



 Thematic session at the 49th Poznań Linguistic Meeting (PLM2019)

16-18th September 2019 at Adam Mickiewicz University in Poznań, Poland



Convenors: Marta Kajzer-Wietrzny[1], Magdalena Perdek, Faculty of English,

Adam Mickiewicz University



Ever since its inception Corpus-based Translation Studies have been preoccupied with systematic and rigorous investigations of translations in the search of patterns that set translations apart and shed more light on the nature of translation. These investigations took various points of reference: source texts, native non-translated texts, texts translated in a different mode, edited or even paraphrased texts. In light with the recent developments in TS it seems that adding non-translated, non-native texts to this set might bring in an even more illuminating perspective on different forms of bilingual communication involving independent or dependent text production (Halverson 2003; Chesterman 2004; Lanstyák and Heltai 2012; Kruger and Rooy 2016).

Undeniably, the data from the European Parliament presents a great opportunity for such research. Not only does the institution provide a sizeable sample of documents translated into many languages, but also the debates held at the EP are available online with simultaneous interpretation. At the same time verbatim reports of the original speeches can be consulted online and even used to be translated up until 2011. As the speakers speak both their native languages and/or English this data source provides a unique opportunity to compare a variety of forms of mediated discourse both in the spoken and written mode (Shlesinger 2008). The European Parliament website provides information about the speakers and topics of the debate and the translations and interpretations are performed by experienced professionals. From the methodological perspective, the EP material guarantees also a great degree of homogeneity as the speeches in various modes are delivered in the same institutional setting (Monti et al. 2005), which is particularly valuable in corpus studies, where data comparability is frequently a challenge.

In this thematic session we welcome papers which report on empirical explorations of diverse forms of mediated discourse (translation, interpreting, non-native), including, but not limited to comparative studies of two modes based on corpora comprising EP debates, e.g. EPIC (Bendazzoli and Sandrelli 2005), EPTIC (Ferraresi and Bernardini 2019), EPICG (Defrancq et al. 2015), EUROPARL (Koehn 2005) or other. Comparative studies of EP data and other registers are also encouraged.

Proposed speakers: Silvia Bernardini (University of Bologna), Adriano Ferraresi (University of Bologna), Ilmari Ivaska (University of Turku), Marie-Aude Lefer (UCLouvain), Tamara Mikolič Južnič (University of Ljubljana), Neža Pisanski Peterlin (University of Ljubljana)


Abstracts for this session formatted in line with the guidelines on the PLM website may be submitted until April 22nd





Bendazzoli, C., and A. Sandrelli. 2005. An approach to corpus-based interpreting studies: developing EPIC (European Parliament Interpreting Corpus). In EU High Level Scientific Conference Series, 149.

Chesterman, Andrew. 2004. Beyond the particular. In Translation universals: Do they exist, ed. Anna Mauranen and Pekka Kujamäki, 33–49. Amsterdam: John Benjamins.

Defrancq, Bart, Koen Plevoets, and Cédric Magnifico. 2015. Connective items in interpreting and translation: Where do they come from? In Yearbook of Corpus Linguistics and Pragmatics 2015, 195–222. Springer.

Ferraresi, Adriano, and Silvia Bernardini. 2019. Building EPTIC: A many-sided, multi-purpose corpus of EU parliament proceedings. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications, ed. Irene Doval and M. Teresa Sánchez Nieto. Accessed January 13.

Halverson, Sandra. 2003. The cognitive basis of translation universals. Target 15: 197–241.

Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, 5:79–86.

Kruger, Haidee, and Bertus van Rooy. 2016. Constrained language: A multidimensional analysis of translated English and a non-native indigenised variety of English. English World-Wide 37: 26–57. doi:10.1075/eww.37.1.02kru.

Lanstyák, István, and Pál Heltai. 2012. Universals in language contact and translation. Across Languages and Cultures 13: 99–121. doi:10.1556/Acr.13.2012.1.6.

Monti, Cristina, Claudio Bendazzoli, Annalisa Sandrelli, and Mariachiara Russo. 2005. Studying directionality in simultaneous interpreting through an electronic corpus: EPIC (European Parliament Interpreting Corpus). Meta: Journal des traducteurs/Meta: Translators’ Journal 50.

Shlesinger, Miriam. 2008. Towards a Definition of Interpretese. Amsterdam/Philadelphia: John Benjamins.



 Workshop at the 10th International Corpus Linguistics Conference (CL2019)

22 – 26th July 2019 at Cardiff University in Cardiff, Wales, UK


Silvia Bernardini1, Adriano Ferraresi1, Ilmari Ivaska12, Marta Kajzer-Wietrzny13[1] 
1University of Bologna, 2University of Naples “L’Orientale”,
3Adam Mickiewicz University,,,

In the changing landscape of corpus linguistics, massive reference corpora ceased to be the favorite tools on which linguistic investigations are carried out. Linguists can and often do resort to a more accessible alternative of smaller but highly specialized corpora, as “in fact the majority of potential questions a linguist could ask, (…) fall outside the types of research (…) general corpora can facilitate” (Ross 2018, 18). This holds true for a number of research areas, e.g. historical linguistics, language acquisition or translation studies, where for example the size of analysed interpreting corpora can frequently be even considered ‘microscopic’ (Bernardini et al. 2018, 22). At the same time, the smaller, specialized corpora are becoming increasingly more complex and more inclusive in that they may also account for different language varieties, languages or communication channels.

Creating even small corpora of such a complex structure poses challenges which will be discussed in this half-day workshop.One may have texts translated in different language versions, transcribed spoken data and/ or audio or video recordings, but only when these data are made available and easily accessible from (parallel) concordance lines, do they become fully usable. In such a scenario, corpus compilation becomes a formidable task, since “recording, aligning and transcribing […] different streams of data is naturally more time consuming and technically difficult than when dealing with a single stream” (Knight 2011, 397). On top of that, specialized corpora are usually designed to carry out contrastive analyses of linguistic features requiring specific annotation produced under specific conditions, which can only be accomplished by ensuring a rigorous treatment of metadata.

These and many more challenges have been addressed while creating the small but highly specialized European Parliament Translation and Interpreting Corpus (Bernardini et al. 2016). EPTIC is an intermodal parallel corpus comprised of multilingual speeches delivered at the European Parliament, as well as their official interpretations and translations, aligned to each other at sentence level, with transcripts time-aligned with the corresponding videos. The EP has features that make it almost unique as a source of intermodal corpora: first and foremost, videos of the speeches interpreted in all the working languages, verbatim reports and translations (up to 2011) are made available for easy download from the Parliament website (Plenary session recordings at the European Parliament 2019). Furthermore, a wide range of languages are available, used both as source and target languages, and speeches are delivered both in impromptu and prepared delivery modes, in the case of English and a few other languages by native as well as non-native speakers, offering a wealth of dimensions of comparison. The current version of the corpus features speeches delivered mostly in the first months of 2011 in five languages (English, French, Italian, Polish, Slovene). In the selection of corpus materials, speeches were sampled from a limited number of plenary sittings, so that texts are as comparable as possible in terms of topics covered. EPTIC is annotated with rich speaker and event-related metadata, as well as with part-of-speech and lemma information using TreeTagger (Schmid 1995). The corpus is made available to the public through a NoSketch Engine platform (Rychlý 2007) hosted by the University of Bologna.
Using the EPTIC experiences as a laboratory, the workshop will illustrate the range of issues pertaining to aligning data in multiple modes and at multiple levels. Its aim is to interactively present these challenges in a user-friendly way, and share the designed solutions with the participants so that they can immediately afterwards apply them to their own monolingual or multilingual corpus projects.

The workshop will be structured around the following three subtopics:
● aligning text to text and text to video;
● tagging and indexing parallel and multi-modal corpora;
● consulting such corpora.

The expected concrete learning outcomes of the workshop will be structured around these subtopics, and activities will be based on preprocessed data that could eventually be included in the EPTIC corpus. For this reason, participants are encouraged to get in touch with the organizers ahead of time for specific languages or language-pairs of interest.
In line with the idea that open source software may “enhance accessibility for all and will promote the cross fertilization of corpus based methods” (Knight 2011, 408) the workshop will be largely based on the use of popular, user-friendly software freely available to the public. The participants will thus gain hands-on experience in multiple text alignment at sentence-level in Intertext (Vondricka 2014) and sentence-level text to video alignment in Aegisub (Aegisub Advanced Subtitle Editor 2019). The processed parallel and multi-modal corpus files will be then tagged and indexed with SketchEngine (Kilgarriff et al. 2014) with a special focus on how to handle metadata in the corpus indexing phase. Finally, the participants will be shown how to fully exploit the potential of tagged, monolingual and bilingual, parallel and multi-modal searches using the No Sketch Engine platform (Rychlý 2007).

The last part of the workshop will be devoted to discussing participants’ needs and ideas in terms of corpus building and/or consultation during one-to-one debriefing sessions with the workshop instructors.

TECHNICAL REQUIREMENTS: The workshop participants will need laptops equipped with Intertext Editor (Vondricka 2014) and Aegisub software (Aegisub Advanced Subtitle Editor 2019). Both programmes are freely available. Please install them prior to the workshop.

Aegisub Advanced Subtitle Editor. 2019. Accessed January 13.
Bernardini, Silvia, Adriano Ferraresi, and Maja Miličević. 2016. From EPIC to EPTIC—Exploring simplification in interpreting and translation from an intermodal perspective. Target 28: 61–86.
Bernardini, Silvia, Adriano Ferraresi, Mariachiara Russo, Camille Collard, and Bart Defrancq. 2018. Building interpreting and intermodal corpora: a how-to for a formidable task. In Making Way in Corpus-based Interpreting Studies, 21–42. Springer.
Ferraresi, Adriano, and Silvia Bernardini. 2019. Building EPTIC: A many-sided, multi-purpose corpus of EU parliament proceedings. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications, ed. Irene Doval and M. Teresa Sánchez Nieto. Accessed January 13.
Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. The Sketch Engine: ten years on. Lexicography 1: 7–36.
Knight, Dawn. 2011. The future of multimodal corpora O futuro dos corpora modais. Revista Brasileira de Linguística Aplicada 11: 391–415.
Plenary session recordings at the European Parliament. 2019. January 13.
Ross, Daniel. 2018. Small corpora and low-frequency phenomena: try and beyond contemporary, standard English. Corpus.
Rychlý, Pavel. 2007. Manatee/bonito-a modular corpus manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing, 65–70.
Schmid, Helmut. 1995. Treetagger: a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43: 28.
Vondricka, Pavel. 2014. Aligning parallel texts with InterText. In LREC, 1875–1879.



[1] Marta Kajzer-Wietrzny is supported by the Polish Ministry of Science and Higher Education, Mobilność Plus (Mobility Plus) programme, grant number: 1610/MOB/V/2017/0


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at
Get started
%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close