Anyone who gets sufficiently far on in learning a language is going to want to start reading, watching or listening to materials intended for natives. This is what a lot of language teachers refer to as "authentic materials". (Now, an alarming amount of so-called authentic materials in the classroom are actually very heavily doctored, but that's not the sort of authentic materials I'm going to talk about today.)
I have often claimed that it is better for the learner to use longer materials than short materials (eg when discussing films vs TV serieses in a previous post). This wasn't an idea I came up with myself, but advice I'd been given when I was at high school, although I can't remember who first suggested it to me.
Anyway, I was told at the time that the first one or two hundred pages of a novel contain most of the language (in terms of grammar, vocabulary and turn-of-phrase) that will occur in the entire book. It therefore follows that the first two hundred pages of any work are the most difficult, and therefore the longer the book is, the easier the ending will be, because you won't be confused by the language. This also means that the book is acting as active revision, and that by the end of the book, you will have learnt most of the major vocabulary in it.
A 50 page short story would intuitively sound easier to read than a novel, but this isn't really the case, because you're dealing with something that is going to be littered with new words on every single page.
And what about a piece of flash fiction? Realistically, we're not going to expect much repetition at all. Compare with the short extracts of authentic works printed in many classroom language textbooks -- none of the "content words" that are specific to the story are likely to be repeated at all, so they will be looked up by the reader, then promptly forgotten about.
Although I was told this about reading books (as opposed to short stories), I believe this holds for any form of literature, fiction or non-fiction, regardless of medium.
A half-hour documentary will be self-reinforcing in a way that a 4 minute news report on the same topic won't be (on TV or radio). An 8 hour long TV series similarly will reinforce its language than an 80 minute feature film, or particulary a 10 minute short film.
Anyway, I've been repeating this advice for years, and I've always said that my experience backs it up. Well, in the little gap between finishing my Gaelic course and starting my next job, I was wanting to do a little work with corpus analysis software and it occurred to me that this would be a great little exercise to get me back into the swing of things, so I downloaded several resources: TextSTAT, a concordancer package written in Python at the Free University of Berlin; AntConc, a Linux/Mac/Windows concordancer by Lawrence Antony at Waseda University in Japan; and the Natural Language Toolkit for python, which will allow me to write more flexible, custom queries on my data.
One of the most basic statistical measures of diversity in a text is the so-called "type:token ratio". The number of "tokens" in a text is the number of individual words, the number of "types" is the number of different word forms.
For example, the phrase "the cat chased the dog" has 5 tokens, but only 4 types, because the is only counted once when determining the number of types.
Or again, "I told the woman to tell the man to tell you" has 11 tokens (11 words in the sentence), but as the, to and tell occur twice each, there's only 8 types in the sentence.
The type:token ratio is exactly what you'd expect if you're at all familiar with statistics: the number of types divided by the number of tokens. In the first example, we have 4:5 = 4/5 = 0.8 (or 80%) and in the second we have 8:11 = 0.727272... (roughly 73%).
Notice how the type:token ratio on the longer sentence is lower than that on the shorter sentence -- in this case it's a matter of my choice of words, but as a general rule, type:token ratio decreases with the length of text examined, which only goes to justify the advice of favouring long-form over short-form materials for the learner.
However, that's still to be proven in practice.
So over the next few weeks, I'll be experimenting with a bunch of public domain texts from Project Gutenberg. I'll be trying to investigate the basic premise of whether long-form fiction is intrinsically easier than short-form, then investigating whether this extends to reading several books by the same author as opposed to books by different authors, and how much of a difference it makes whether these books are part of a series or individual stories.
The size of this study is going to be very small, as the main goal for me is simply to gain a better understanding of the technology and to reason through the process of designing logically sound research in a corpus, so the conclusions won't be scientific proof of anything, but it will hopefully be interesting (to me at least).
If you're aware of any research that covers the areas I'm looking at, please feel free to drop a reference in the comments, and if you have anything to add or suggest, I'm all ears.