I was talking about the type:token ratio last time, and that seemed as good a place as any to start. I managed to skip the logical first step, which would have been to compare a novel and a short story, but I'll have to come back to that later.
What I started with was an Italian novel, but I got figures that were too high to be useful. One of the problem with languages such as Italian is that they write some of their clitics in the same written word as the main word (EG "to know (someone)" -> conoscere; "to know me" -> conoscermi), increasing the type:token ratio significantly. You've also got the problem that it has verb conjugations and it drops subject pronouns in most situations. Overall Italian (and Spanish and Catalan, among others) would be a bad choice for a demonstration language. Today, I'm using English as it's a very isolating language -- the only common inflections are past-tense-ed, second-person-present-s and plural-s. This makes it easy to get a reasonably accurate measure of the lexical variety without any clever parsing. I will most likely use French at some point too, because while it is not as straightforward as English in that sense (it's got a lot of verb conjugation going on), it doesn't have the same clitics problem as Italian, and the French don't drop their pronouns.
Today's findings: 1 - running ratios
I decided to look at how the type:token ratio changes as a text proceeds. I wanted to measure this chapter by chapter, counting the types and tokens in chapter 1, then loading the second chapter into the concordance and checking the type:token ratio for chapters 1 & 2 combined, then 1, 2 & 3 etc. I realised, however that it would be more efficient to load all chapters into memory at the same time and work down from the other end: all chapters, then close the last chapter and take the figures again, then close the second last chapter and take the figures again.
In the end, I got a nice little graph (using LibreOffice) that showed a marked tendency to decreasing type:token ratio as the books progressed:
So by one measure, the longer the novel is, the easier it would appear to be.
Today's findings: 2 - introduction rates
I figured I could go a bit deeper into this without generating any new data. What I wanted to look at now was how much new material was introduced in each chapter -- ie. a ratio of new types to tokens. It's easy enough to do -- I could obtain the number of new types in any given chapter by deleting the running total at the previous chapter from the running total at the current chapter.
The graph I got was even more interesting than the last:
One curious feature is the large uptick at the end of the children's novel Laddie (green). This illustrates one quirk that the learner should always bear in mind: kids books are often actually more complicated linguistically than adults' books, as the author on some level seeks to educate or improve the person reading. The author of this book seems to have kept the language consistently simple through most of the book, but realising he was coming to the end, crammed in as much complexity as possible.
Another curious feature is that the figures claim no new vocabulary is introduced in the fourth chapter of The 39 Steps (yellow). While this is theoretically possible, its more likely that it's ...ahem... experimenter error, which a quick look at the actual figures verifies: chapters 3 and 4 are listed in my output as being exactly the same length, which is more than a little unlikely. It looks like I loaded the same chapter twice...
Notice that in both graphs, the figures are the same at chapter one. This is to be expected, as every type encountered in the first chapter is encountered for the first time in the book (by definition).
So what happens if we stick the running ratio of type:token against the introduction rate of new types?
Perhaps the measure of efficiency is related to the difference between the running ratio and the introduction rate, and once that gap starts to narrow, there is no advantage?
Problems with today's findings
This was a first exploratory experiment, so I didn't conduct it with a whole lot of rigour. Here are the main factors affecting todays results:
- I didn't eliminate common words -- it is impossible to see from the figures I have how many of the types introduced at any stages are ones we would expect learners to know already and how many will be genuinely new to them.
- When examining Pride and Prejudice and The 39 Steps, I hadn't told the concordancer to ignore case, so anything appearing at the start of a sentence and in the middle would be counted as two types -- eg that and That. (It was the first time I'd used TextSTAT and I hadn't realised it defaulted to case-sensitive -- I won't make that mistake again.)
- The length of chapters varies significantly from book to book and even from chapter to chapter within books, so the lines are not to scale with each other, and each individual line is not in a continuous scale with itself. The graphs, though presented in a line, are arguably not true line graphs, as they occur from samples arbitrarily dispersed.
- There are plenty of frequency lists on the net, so I'll be able to eliminate common words without any real difficulty.
- The case sensitivity issue, now that I'm aware of it, will not be a problem.
- When I ran the initial data, I was using TextSTAT as my installation of Python and NLTK was playing up (I had too many different versions of Python installed, and some of the shared libraries were conflicting). I've now got Python to load NLTK without problems, so I can do almost any query I want. Future queries will be sampled regularly after a specific number of words.
At some point I'm going to want to go back and compare short stories with novels, but for now I'm going to head a little further down the path I'm on.
My first task is to work out a decent sampling interval: ever 1000 words? 5000? 10,000? 50,000? I'll run a few trials and see what my gut reaction is -- that should be the next post. (It might even prove that the chapter is the logical division anyway -- after all, it divides subjects, which would indicate different semantic domains...)
I also want to look at what happens when we look at sequels after each other. Those of you familiar with John Buchan will notice that I've included such a pair as individual novels here -- The 39 Steps and Greenmantle. I might include initial findings from this next time, as they'll determine my next step.
After this I'll either move on to looking at more pairs of original book + sequel (to look for a generalisable pattern), looking at longer serieses of books (to see if they get continually easier) or comparing book-and-sequel to two different books from the same author (to see if any perceived benefits from reading a book and its sequel are just coincidence and really only because of the author).
Remember, though, that this little study is never going to be scientifically rigorous, as I don't really currently have the time to deal with the volume of data required to make it truly representative. However, it's nice to think how big a job this would have been before computers made this sort of research accessible to the hobbyist. Many thanks to the guys who wrote the various tools I'm using -- your work is genuinely appreciated.