7 November 2014

Corpus extension

In a machine translation MOOC I'm taking, we got a task to do word alignment (a preprocessing step for statistical machine translation) on a parallel corpus from the Bible. I don't know why I didn't think of it before, but being the most translated book in the world there obviously exists an Albanian translation as well. So a quick search led me to a parallel Bible corpus with 100 languages [1]. The texts are available as XML files, so I wrote another script to extract the plain text sentences. Note that it doesn't preserve the order of the books, but this is not that important for my purposes. The Albanian Bible has about 750,000 words.

use strict;
use XML::Simple;

my $bible = XMLin('Bible.xml');
my $text = $bible->{text}->{body}->{div};

open(OUT, '>:utf8', 'bible.txt');
foreach my $book (values %{$bible->{text}->{body}->{div}}) {
    foreach my $chapter_key (sort {getNum($a) <=> getNum($b)}
                             keys %{$book->{div}}) {
        if ($chapter_key !~ /id|seg|type/) {
            my $chapter = ${$book->{div}}{$chapter_key};
            foreach my $verse_key (sort {getNum($a) <=> getNum($b)}
                                   keys %{$chapter->{seg}}) {
                my $verse = ${$chapter->{seg}}{$verse_key};
                my $content = $verse->{content};
                $content =~ s/\t+//g;
                $content =~ s/\n//g;
                print OUT $content."\n";

# Helper function for sorting by chapter/verse numbers
sub getNum {
    my $in = shift @_;
    my ($out) = $in =~ m/\.(\d+)$/;
    return $out;

This motivated me and I went to look for other parallel corpora and discovered a website which is compiling parallel corpora from around the web: OPUS (Open Parallel Corpus) [2]. They have a few corpora including Albanian as well, namely two OpenSubtitles [3] corpora, the SETimes corpus [4] and a Quran translation [5]. I applied similar preprocessing steps as to the Wikipedia corpus, although to a lesser degree because the corpora were already of higher quality; most importantly, sentences were already split. After preprocessing, the corpora had around 8 million (OpenSubtitles 2012), 12 million (OpenSubtitles 2013), 5 million (SETimes) and 450,000 words (Quran), respectively.

Combining all these corpora with the Wikipedia corpus I already have, I get a new large corpus just short of 37 million words! This should vastly improve the quality of the corpus, also because I now have very diverse texts: a lot of colloquial and spoken language from subtitles, news articles and maybe somewhat old-literary language from the religious texts in addition to encyclopaedia entries. Now I should also be able to find out more about the verbal system, which would have been a bit hard with Wikipedia articles only.

[1] http://homepages.inf.ed.ac.uk/s0787820/bible/, available under the CC-BY-NC-SA license

[2] Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. [pdf] In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012)

[3] http://www.opensubtitles.org/

[4] http://nlp.ffzg.hr/resources/corpora/setimes/, available under the CC-BY-SA license

[5] http://tanzil.net/trans/

1 November 2014

Most frequent words and n-grams

Frequency lists are widely used in language learning. There are many books and Anki decks available, containing the 5000 most common words or the 1000 most frequent verbs, which allows you to learn words that are actually used. Obviously, frequency lists will vary a lot depending on what kind of corpus they are compiled from (literary, scientific, speech...) and a Wikipedia corpus is very biased as well.

However, having all the data myself allows me to not only make a word list sorted by frequency, but also to obtain n-gram frequencies. N-grams are sequences of n words - unigrams are single words, bigrams are sequences of two words etc. Google offers an Ngram Viewer which can show n-gram frequencies in literary corpora from a number of languages. With n-grams I can also find out common constructions or even short phrases or which words usually occur next to any given word.

The following Perl script will print frequency-sorted n-grams of an input text to a file:

use utf8;
use strict;
use IO::Handle;
use open qw(:std :utf8);

my $xgram = shift(@ARGV); # 1-, 2-, ... -grams

my %ngrams;
my $ngram_separator = " ";
my $min_frequency = 5;

while (my $line = <>) {
    chomp $line;

    # Split sentence into words
    my @tokens = split(" |(?![%.])\\p{P} | (?![%.])\\p{P}|\\p{P}\$", lc($line));
    my @ngram_window;

    foreach my $token (@tokens) {

        # Move sliding ngram window
        if ($token =~ /\S/) {
            push(@ngram_window, $token);
            if ($#ngram_window >= $xgram) {
                my $ngram = join($ngram_separator, @ngram_window);

# Write ngram frequencies to file
open(my $fh_ngram, '>:encoding(utf8)', $xgram.'grams.csv');
foreach my $ngram (sort { $ngrams{$b} <=> $ngrams{$a} } keys %ngrams) {
    if ($ngrams{$ngram} >= $min_frequency) {
        print $fh_ngram "$ngrams{$ngram}\t$ngram\n";
close $fh_ngram;

With the resulting frequency lists I can already do a lot of cool things. For example, I know that shekulli means "century". Searching through the unigram list, I can obtain all possible forms, i.e. with the different case endings, sorted by frequency (grep is the Unix command-line search tool and head restricts the output to 10 results):
$ grep "shekull" 1grams.csv | head
2886    shekullit
1605    shekullin
609     shekulli
165     shekull
146     shekullore
51      shekullor
31      shekullare
28      shekullar
17      shekullorë
14      shekullarë

I can then search through the bigram list to find all prepositions that occur with the genitive form shekullit:
$ grep " shekullit" 2grams.csv | head
1439    të shekullit
394     i shekullit
391     e shekullit
125     gjatë shekullit
104     te shekullit
48      rreth shekullit
42      prej shekullit
30      së shekullit
17      para shekullit
17      përket shekullit

On the other hand, the list is much shorter for the accusative form shekullin:
$ grep " shekullin" 2grams.csv
1155    në shekullin
86      ne shekullin
12      për shekullin

27 October 2014

Quality of the Albanian Wikipedia

While skimming a bit through my 10 million words corpus from the Albanian Wikipedia, I noticed that quite often there were entire paragraphs in English and sometimes German. So I started systematically searching for common English and German words to find all such paragraphs and delete them from the corpus. This way I deleted 7,499 sentences with 144,309 words, or 1.35% of all words. It's only a very small portion of the corpus, but still surprisingly large and removing it should have definitely improved the quality of the corpus, especially when there are short words with the same spelling in Albanian and other languages (false friends), but with very different frequencies in a normal corpus of the respective language.

It seems that often articles (e.g. about ammunition) have been copied in their entirety from other editions of Wikipedia to the Albanian one, while other articles (e.g. about Kim Jong Il) appear in a strange mix of Albanian and English, sometimes switching in the middle of the sentence. This might also be a result of bad machine translation, so even the quality of Albanian sentences in such articles has to be questioned. However, I've left all Albanian sentences in the corpus because I'm not able to judge their quality yet and I just assume their proportion to be statistically insignificant for now. Interestingly, articles from foreign Wikipedias appear mostly in the following domains: history, military, technology and sport (especially from the German Wikipedia).

The updated corpus can be found on the Resources page. I'll probably make other slight adjustments in the future when I detect other quality issues and always update that page accordingly.

22 October 2014

My Wikipedia corpus

For my studies of Albanian I'm mainly using the Albanian Wikipedia corpus of about 54,000 articles. Wikimedia regularly publishes XML dumps of all editions of Wikipedia (and also Wiktionary and their other projects).

I don't need any of the markup or links included in Wikipedia articles, just the raw text. Therefore I used a Python script by bwbaugh (GitHub) to remove the wikicode. After that, I applied the following regular expression based preprocessing steps with some simple Perl scripts (that's what we use in class currently) to remove most remaining special characters, markup and any parts that are likely not proper sentences, and split all sentences onto separate lines:
  1. Original Albanian Wikipedia corpus: 12,394,464 words in 54,522 articles (dump from 23/09/2014).
  2. Lines without punctuation removed (/^[^<>.,;:!?]*$/): 12,086,474 words.
  3. Remove text between any kind of brackets (s/(\[[^][]*\]|\([^)(]*\)|\{[^}{]*\}|<[^><]*>)//g) and then stray brackets and quotation marks (s/([\p{Ps}\p{Pe}\p{Pi}\p{Pf}"]|&\w+;)//g): 10,876,531 words.
  4. Trim spaces (s/\s\s+/ /g), remove unnecessary spaces before punctuation (s/ (?=(?!['&])\p{Po})//g), trim double punctuation (s/(\p{Po})\1+/\1/g) and add missing spaces after punctuation (s/(\d(?!['&])\p{Po}\K(?=[\p{L}\p{S}]+ )|[ \p{L}\p{S}]+(?!['&])\p{Po}\K(?=\d)| [\p{L}\p{S}]+(?!['&])\p{Po}\K(?=[\p{L}\p{S}]+ ))/ /g): 111,813 spaces and 18,184 punctuation characters removed.
  5. Split sentences onto separate lines (s/[^\p{Lu}. ]{2,}(?<! (Prof|prof|shek))(?<! (Fig|fig|psh))(?<!  (dr|nr|fq|sh|dt|st|gj|mr|xh|ef|dh|pp|no|kl|zv|ft))[.:;!?]\K (?=([\p{Lu}\p{N}]))/\n/g): 619,931 sentences. I came up with that regex after some experiments; it splits under the following criteria and should split correctly for almost all sentences:
    • Punctuation (one of [.:;!?]),
    • preceded by at least two consecutive characters which are not capital letters or a full stop and ignoring a list of 2-4 letter abbreviations after which capital letters or numbers are used (I gathered these abbreviations in separate experiments),
    • followed by a space and either a capital letter or a number.
  6. Removed non-alphanumeric characters (s/^[^\p{L}\p{N}]+//g) and numbering of lists (s/^\p{N}+\. //g) at the beginning of a line: 22,980 characters removed.
  7. Lines with less than 5 words removed (/^([^ ]* ){4}.*$/): 10,653,790 words in 525,291 sentences.
Thus my final version of the corpus has more than 10 million words in more than 0.5 million sentences with about 20 words per sentence on average. Feel free to it from the Resources page and use it for your own projects!

Obviously Wikipedia is, despite its enormous size, a somewhat limited corpus. It encompasses articles on a very wide range of topics and therefore also contains a large number of words, but on the other hand, many words that are very common in conversation or literary texts, especially adjectives and certain types of verbs, will be quite rare. Furthermore, Wikipedia articles are rather technical and also favour specific types of grammatical constructions. But I think it's nevertheless the best and definitely the largest corpus I can find for this project and I'll see what I can do with it.

I've also started a Google spreadsheet where I'm noting words and bits of grammar whenever I learn them. Click here to watch what I'm learning!

19 October 2014

Experiment: Corpus-based language learning

Corpora (large collections of texts) are used everywhere in linguistics and also have many applications for language teaching and learning (see for example McEnery & Xiao, 2010: What corpora can offer in language teaching and learning). But a typical language learner never uses a corpus directly, only tools based on them, and I was wondering whether it would be possible to learn a language only from a large corpus by squeezing out all the information that millions of words can offer.

So in a sort of self-experiment I will try to learn as much Albanian as possible from only* the Albanian Wikipedia. (*) I might also listen to Albanian music, watch TV or read newspapers and books, but the main tool for learning vocabulary and grammar will be Wikipedia; I will not consult any grammars or dictionaries, nor translate Albanian text in any other way. Essentially this is a very extreme version of "learning a language like a baby" (although adults learn languages a lot faster than kids anyway), by input only, but getting the most out of this input with some "computational tricks".

Why do this?
Because it's fun and I like learning new languages. Also, especially since I've started my Erasmus in France, we've been using corpora in several of my classes and learning natural language processing techniques to extract often fascinating information from them. Now I want to see want to see what I can find out by applying these techniques to a completely foreign language.

Beautiful Albanian coast (Wikimedia Image)

Why Albanian?
I wanted to pick a language that is sufficiently distinct from any language I've ever studied, spoken in an area that I can relatively easily visit (i.e. in Europe), but still Indo-European (otherwise Basque could have been a good choice) and obviously with a decent-sized free corpus available. Being Indo-European will probably make this project slightly easier and allow me to make some very basic assumptions about the language, but nonetheless there should be more than enough challenges because Albanian split off from the other Indo-European subfamilies a long time ago and is now quite different.

Disclaimer: I know a tiny bit more than nothing in Albanian. The three words shqip (Albanian), faleminderit (thank you) and mirë (good). I can also say How are you? and I'm hungry, but I've no idea how to spell either. Anyway, I don't think this gives me a very big headstart. This is what I can understand (green) or guess (yellow) from a random article from the Albanian newspaper Shekulli about a topic I'm completely unfamiliar with:
PRISHTINË - Senatori amerikan Christopher S. Murphy vazhdon vizitën e tij  (in) Kosovë, me takimet e zyrtarëve të qeverisë dhe shoqërisë (shock?) civile. Më pas senatori Murphy pritet të mbajë një fjalim me studentët e (of) Universitet Amerikan  (in) Kosovë.
Ditën e djeshme senatori amerikan Murphy u prit nga Presidentja e Kosovës Atifete Jahjaga, me të cilën u (with the aim of?) informua edhe për (him about?) situatën e (of) vështirë politike  (in) vend.
Gjithashtu Murphy ishte i ftuar edhe  (in) Ambasadën Amerike  (in) Kosovë, e cila (aim?) e mirëpriti (improve?) duke festuar  (in) një nga lokalet e (of) Prishtinës.
While this might already seem like a good portion of the text, I only found out that the senator visited Kosovo and whom he met there, but nothing at all about the reasons of this visit. If any Albanian speakers should be following this: Please don't give any hints about the meaning of the text or the correctness of my guesses. I shall find out myself soon enough.

18 October 2014



I guess I've finally become a blogger as well. I'm Enno from Germany and I'm studying Computational Linguistics and French at Trinity College Dublin although I'm currently an Erasmus student at Université Stendhal in Grenoble until June 2015.

This blog will be about anything that interests me, mainly my language learning projects (I'm comfortable in five languages and have learned a few others to various degrees), natural language processing (NLP) or, as my next project, a combination of the two. Stay tuned!