28 May 2012

Authentics: long v short - pt 4

Well, I had a nice weekend and visited some friends in Edinburgh for a wedding.  The weather's too good to spend too much time inside, so I'll just write up a few more tests then go and enjoy the sunshine.

Multiple books in a series
Today's figures come from two of the books I've already mentioned -- John Buchan's The 39 Steps and Greenmantle, and the next book in the series: Mr Standfast.

Again, the graphs at different sample sizes show different parts of the dataset more clearly than others:
The graph as 1000 words is too unstable to clearly identify the end of the the first book, and the start of the second book is only identifiable because the introduction rate is greater than the running ratio for the first time in any of my tests.

The 2500 and 5000 word samples give us a clear spike for the end of the first book, but the end of the second book is obscured slightly by noise, and becomes clearer again in the 10000 and 25000 word sample sizes, although the end of the first book is completely lost by the time we reach the 25000 word sample graph.

Having done all that, I went back and verified the peaks matched the word counts -- The 39 Steps is 44625 words long, and Greenmantle is 107409 words long, so ends with the 152034th word.  The peaks on the graphs all occur shorlty after 45000 and 150000.

It was the first graph, from the 1000 word sample set, that piqued my curiosity.  Having spotted the two lines intersecting at the start of the third book, I decided to check the difference between the running type:token ratio and the introduction rate, and I graphed that.  At 1000 word samples, there was still too much noise:
However, given that I already knew what I was looking for, I could tell that it showed useful trends, and even just moving up to 2500 word samples made the trends pretty clear:
Going forward, I need to compare the difference between books in a series, books by the same author (but not in a series) and unrelated books, and I believe that the difference between the running type:token ratio and the introduction rate may be the best metric to use in comparing the three classes.

Problems with today's results
I can't rule out that the spikes in new language at the start of each book aren't heavily influenced by the volume of proper nouns, as Thrissel suggested, so I'm probably going to have to make an attempt at finding a quick way of identifying them.  The best way of doing this would be to write a script that identifies all words with initial caps in the original stream, then asks me if these are proper nouns or not.

By treating the three books as one continuous text in the analysis, it looks like I've inadvertently smoothed out the spike somewhat at the start of each book.  In future I should make sure individual samples are taken from one book at a time so that the distinction is preserved.


Bob Blackburn said...

This is a very interesting series. Thank you for the analysis.

Sometimes my math and computer background gets in the way of language learning. So it is nice to see a direct application to language learning.

Nìall Beag said...

Never let yourself believe it gets in the way! There's nothing I find more useful than analytical thought when I'm trying to work out what something means and why. If logic seems to get in the way, just take it to another level of abstraction...

Anyway, I'm glad you're finding it interesting. I'm really enjoying it myself, as it's making me ask questions I never would have really thought of otherwise.

Ironically, I started this when I finally finished university for what is almost definitely the last time, and suddenly it reminded me of what I really liked about studying the first time round -- exploring ideas; seeing how things work; tinkering with settings until something useful or interesting or even unexpected happens.

Speaking of interesting things, I have another post to write....

Bob Blackburn said...

Thanks Niall.

I think part of it is trying to be a perfectionist. Maybe that is why I don't mind studying grammar. It breaks things down to basic components and builds them back up to form the whole. Much like programming.