30 May 2012

Authentics: long and short pt 6

Well it seems like I spend as much time writing up my figures as generating them, which is a good thing as it's when I'm writing that I'm most actively considering the implications of what I've done so far.
The next set of figures took very little time to generate, and gave me something to think about.  First I copied the code that identifies all types containing uppercase characters, and wrote a revised version that checks if the same type occurs elsewhere in all lowercase.  This gives me two sets of types -- those "not" in lowercase and those "never" in lowercase.  I wanted to examine the difference to see how this affects my previous results.
I then generated the following table for John Buchan's first three Richard Hannay novels (The 39 Steps, Greenmantle and Mr. Standfast):
The 39 StepsGreenmantleMr Standfast
No of tokens44625107409140425
Tokens not in lowercase91917082262
Tokens never in lowercase60411931565

Excellent -- at least two thirds of the types I've been ignoring are still ruled in as candidates, so my previous figures and their conclusions aren't necessarily invalid.

So, with the lists shrinking, I could get a closer look at them.  I've already pointed out the problem with "I" -- it's always in uppercase, so always a candidate for elimination.  I can hard-code it as an exception, but for now I want the code to be as general and language-agnostic as possible -- and besides, it's one word in thousands.

One other language-specific problem that comes up is the fact that nationalities and languages are capitalised in English, but not in other languages, and it's fair to say that most learners of English would be expected to know/learn words like "English", "Scottish" and "Irish", so it's not necessarily right to rule them out.  And of course they would also know "England", "Scotland" and "Ireland", so I'm no longer even sure that ruling out proper nouns leaves us with a more valid measure of difficulty for language learners.
Looking at the list also indicated a problem technology-wise: NLTK's word_tokenize leaves punctuation in with its tokens, so "end." is a different from "end", and any word at the start of a passage of direct speech is likely to be considered a new token.  This skews my type:token ratios slightly, but I can't be sure whether it's statistically relevant.  But it's the question of direct speech that is making me think most.  Consider the real examples (from the 39 Steps) of "'Let" and "'Then".  You can be pretty sure that "let" and "then" will have occurred before this.  But notice that the words with the leading quotemark start with capitals, as direct speech in prose tends to do.  This means they're being eliminated.  So do I really need to account for this?  Is it statistically significant enough to worry about?
If I was doing this as real, serious research, I would need to write something to strip out appropriate punctuation (or, better, find something that someone else has written to do the same thing).

Anyway, so I decided to throw in another book and see how it came out.

In Pride and Prejudice by Jane Austen, only 49% of the "not lowercase" tokens were present in the "never lowercase" set.  Curious, I decided to expand my figures a bit...
The 39 StepsGreenmantleMr StandfastPride & Prejudice
No of tokens44625107409140425138348
No of types646111381139048171
Tokens not in lowercase91917082262669
Tokens never in lowercase60411931565300

OK, that last row's something new: the "never:not" ratio was something I calculated to work out how innacurate I'd been previously -- it's not something that offers any meaningful results as such -- so I wanted a measure that constitutes a goal in itself: "never:types" is the ratio of the types in the "never lowercase" category as a proportion of all types.  It's intended as a rough estimate of the density of proper nouns in a text (still incorporating all the inaccuracies previously discussed). It's notable how both the never:not and never:types ratios are so consistent within the three Buchan novels, and yet so different from Austen's writing. You'd probably expect that, though -- personal style and genre affect this greatly (the Hannay novels involve a lot of travel, Pride and Prejudice is restricted in terms of characters and locations).

All this umming and ahhing over proper nouns is getting distracting.  For now, I'll proceed with the "never lowercase" types as a rough estimate.  I'm not really looking for accurate numbers anyway, just proportions, and this should be good enough for now.

I really need to move on and look at the question that was biggest in my mind when I started out: what's the difference between reading a series of books and reading multiple books by the same author?

So it's back to Project Gutenberg to find some likely candidates.  Émile Zola vs Victor Hugo, perhaps....


Anonymous said...

What exactly happens with the punctuation - do you (potentially) get two or six tokens for "end", "end.", "end," "end;" "(end" and "end)"?

Nìall Beag said...

Hmmm... good question.

I ran a test with the following:
(End end) (end End) end. end; end: end,
It gave me 8 tokens, but not quite what I was expecting:
['end', ')', '(', 'End', ';', ':', 'end,', 'end.']
So the full-stop and comma are incorporated into the token, but most other punctuation isn't.

So having done that, I took a look at quote-marks.
A single-quote at the start is included in the token, but one at the end is a separate token, eg:
'end end' -> 'end + end + '

So let's have a look at apostrophes in the middle... and it splits it's into it + 's, and it even knows to split don't into do and n't. (It seems to be switched to English by default.)

I can understand that there's some ambiguity in the full stop (eg Mr.) but even then, the end of a sentence is more common....