Anyway, as I said last time, I wanted to start comparing a fixed length of text, rather than variable-length chapters as my benchmark. I was looking for a sampling length that would give a clear picture of the overall progression without having too much interference from little local fluctuations. My first set of results suggests that this is a fool's errand. The following set of images shows the graphs for the novel Greenmantle by John Buchan, with samples taken ever 1000, 2500, 5000 and 1000 words.
Alternatively, I could move away from using linear sampling/projections and start charting using logarithmic or exponential data, and while now would be a good time to start refreshing my memory on that sort of statistical analysis, it also risks diverting me from the task at hand, and I'm following the Coursera.org machine learning course currently, so I should be able to get the computer to do the work itself in a few weeks anyway. Besides, I've still not got myself a high-frequency word list, and the pattern might be completely different once I've eliminated common words of English from the equation.
So for now I'll stick to working with multiple sample sizes. I'll admit to being a bit simplistic in my approach to this so far, as I ran my little Python program once for every sample size, rather than just running it once with the smallest sample size then resampling the data.
The program I'm using at the moment is pretty straightforward:
def collect_stats (token_list, step_size):This takes an NLTK token list (it would work with any simple list of strings too, though) and the size of samples to be taken, then builds up a list of lists [[a1,b1],[a2,b2],...] where each a is the number of the last word included in the sample, and each b is the number of unique tokens from the beginning of the text to the ath word.
i += step_size
running_types_total = number_of_types (token_list[:i])
The number_of_types function just returns len(set(w.lower() for w in token_list)).
This means that at every stage I have a running total of tokens, and it's only when I want to produce a graph that I calculate the number of new tokens in the given slice (= b(n) - b(n-1)), and there's therefore no reason why I can't skip several slices to decrease my sampling rate (eg b(n) - b(n-3)).
I've taken a running sample of three books from the same series -- The 39 Steps, Greenmantle and Mr Standfast, and run them through as one text, so I'll look at the output of that next, but I don't think it'll be much use until I've got something to compare with -- either/both of: a selection of novels by one author that aren't a series; and a selection of novels by different authors.