Well, I had a nice weekend and visited some friends in Edinburgh for a wedding. The weather's too good to spend too much time inside, so I'll just write up a few more tests then go and enjoy the sunshine.
Multiple books in a series
Today's figures come from two of the books I've already mentioned -- John Buchan's The 39 Steps and Greenmantle, and the next book in the series: Mr Standfast.
Again, the graphs at different sample sizes show different parts of the dataset more clearly than others:
The 2500 and 5000 word samples give us a clear spike for the end of the first book, but the end of the second book is obscured slightly by noise, and becomes clearer again in the 10000 and 25000 word sample sizes, although the end of the first book is completely lost by the time we reach the 25000 word sample graph.
Having done all that, I went back and verified the peaks matched the word counts -- The 39 Steps is 44625 words long, and Greenmantle is 107409 words long, so ends with the 152034th word. The peaks on the graphs all occur shorlty after 45000 and 150000.
It was the first graph, from the 1000 word sample set, that piqued my curiosity. Having spotted the two lines intersecting at the start of the third book, I decided to check the difference between the running type:token ratio and the introduction rate, and I graphed that. At 1000 word samples, there was still too much noise:
Problems with today's results
I can't rule out that the spikes in new language at the start of each book aren't heavily influenced by the volume of proper nouns, as Thrissel suggested, so I'm probably going to have to make an attempt at finding a quick way of identifying them. The best way of doing this would be to write a script that identifies all words with initial caps in the original stream, then asks me if these are proper nouns or not.
By treating the three books as one continuous text in the analysis, it looks like I've inadvertently smoothed out the spike somewhat at the start of each book. In future I should make sure individual samples are taken from one book at a time so that the distinction is preserved.