Ah well, it looks like it wasn't meant to be.
So I started working towards custom code to eliminate the proper nouns manually, something which would be handy in the future anyway. The first step was to identify some candidates for further inspection, and seeing as I'm working with English, that's pretty easy: if it's not all in lower case, there's something funny about it. I wrote the code to identify all the tokens (words) that contained capitals. Yes, at this point I could have checked whether it was the start of the sentence or not, but that wouldn't have really helped, because proper nouns occur at the start of sentences too, so i'd still need to check.
When I generated my set of candidates, though, it was a little long. For The 39 Steps, I was looking at 919 tokens to check manually, and that's a fairly short book. As I'm doing this for fun, it seemed like checking that many would be a little bit boring, particularly in longer books. (I later checked the candidate set for the 3 books in total, and it turned out to be over 3000 words, which is more than my time's worth.)
My first quick test then was to have a look at the difference in figures. Eliminating every single item with any capitals in it drops the type:token ratio in The 39 Steps from 14.48% to 13.14% -- that's almost a a 10% drop (it's 1.35 percentage points, but it's 9.27 percent). Before properly addressing the proper nouns, I wanted to see how big a difference this crude adjustment makes to the figures. It seemed just a little too high to realistically be led by proper nouns alone. But can that be? I mean, how many words are likely to occur only at the start of sentences?
So on I went, hoping that the data I could generate at this stage would start to shed some light on this figure.
The first graph I produced showed me the running type:token ratios and introduction rates for both the full token set, and the token set with non-lowercase words eliminated:
another time). Here's the same graphs, but with 2000 word samples instead of 500 word samples:
Books as a series
But I had all the infrastructure in place now, so I figured I might as well rerun the analysis on the 3 books as a single body and see what came out. Let's just go straight to the relative difference between the lines for all words and eliminating all words not entirely in lower case:
Bingo: we've got decent support for Thrissel's suggestion that a lot of proper nouns are introduced early on in... at least some novels.
Not the sort of information I was originally looking for, but actually quite interesting. It's kind of turning the project in a slightly different direction than I had planned. I'll just have to go with the flow.
What I did wrong today
One of the minor irritations of the day was when I started writing up my results, and after having done the coding, data generation and analysis, I realised a fairly simple refinement I could have made. It was a real *palmface* moment: I could have simply taken my first list of candidate proper nouns and eliminated any candidates that also appeared completely in lower case. Having done that, I would have been left with a much shorter list of candidates, and it may well have been worth my time manually checking the results.
But of course, that's as much the point of the exercise as anything: to work through the process and the problems and to start thinking about what can be done better.
It also occurs to me now that I also managed to eliminate every single occurrence of the word I from the books! Quite a fundamental error, even if it only made a minute difference to the final ratios.
Perhaps I'm being a little too "hacky" in all this. I'll have to pick up my game a bit soon....