In 2004, I signed up for a degree with the Open University, Modern Language Studies. As the OU issues degrees in line with the English 3-year system, and as I was studying part-time, it was alwayd going to take me at least 6 years to complete it. At the time, I was planning on becoming a high school language teacher, so I had a look at the guidelines for teaching in Scotland, which said you had to live and work in a country where your languages were spoken for a year (your second language could be 6 months, but you'd have to make it up to a year to gain full registration.
This means that from the time I started, I really had at least 8 years until I could hope to enter teacher training. In that time, I reckon the rules have changed significantly 4 times, meaning that they would have changed at least once for anyone in a standard Scottish 4-year degree. For someone trying to plan out a career, this is pretty disappointing.
As I recall it, the rules when I started mandated a full year's worth of learning in your given subject (120 scotcat points), with at least half of that being at degree level (Scottish 3rd and 4th year, English 2nd and 3rd). Quite soon after I started, the requirements were reduced to two-thirds of a year (80 points) with at least half of it being Scottish 2nd year or above. The next change was halving the foreign residency requirements for language teachers.
These changes were all made to increase the number of applicants for teacher training, as the Scottish Government was worried about a crisis when a large number of current teachers retired.
But the numbers are up, and there's been a surplus of teachers in many subjects (the older guys are holding off retiring to build a bigger pension pot) so they've changed the rules again.
Before entering a PGDE, you now need to have 120 points again. At least 40 of them should be 3rd year undergrad or above, and another 40 should be 2nd year undergrad or above. 80 of those points must be in the subject and 40 can be on a "related" subject, whatever that means.
But...
I just spent a year at the Sabhal Mòr on Skye, doing the second year of the Gaelic-medium degree scheme so that I would be able to go into Gaelic teaching, and now, all of a sudden, it's no longer valid. If there'd been some kind of advance warning, I could have studied a mixture of second and third year courses, and brought myself up to the required standard in the same timeframe (and I would have probably enjoyed it more than the course I did, to boot!)
I've still got to get the French residency requirement covered (which my forthcoming year in Corsica will deal with) before I can go back and do a PGDE. But the plan was to do a PGDE in Modern Foreign Languages (French) and Gaelic, and as it stands, I'm not going to be eligible. Even if I switch to just MFL (French and Spanish), I still need to spend an additional 3 months working in a Spanish-speaking country in order to get teacher registration for Spanish.
Would it be too much to ask for changes to be announced a year or two in advance so that people like myself could attempt to plan their careers? As it stands, I'm left wondering whether to abandon the Gaelic or to start taking distance modules in order to make up the shortfall in degree-level credits.
But I really wanted to be finished with study. Since I started primary school, there's only been three-and-a-half days of my life where I haven't been in formal education. Any time I like, I can claim my third degree.
It's a bit of a disappointment.
28 July 2012
27 July 2012
Sorry for going quiet
For the last month or so, I've been living on Islay, working on a summer placement at the Gaelic college. I've just found out that I've been accepted as an English teacher at the University of Corsica. It's a pretty junior post, but I'm looking forward to getting out there, and walking, climbing, swimming and cycling... not to mention learning one of the lesser-studied Romance languages (Corsican) while improving my command of one of the more widely studied ones (French).
It's also not really a full-time post, so it'll give me time to start trying to build my Skype client base, and maybe even to break into translation. It should also leave me with time to work on a wee language-related programming project I've started on.
And whatever else happens, I'll be in Corsica.
It's going to be a good year.
It's also not really a full-time post, so it'll give me time to start trying to build my Skype client base, and maybe even to break into translation. It should also leave me with time to work on a wee language-related programming project I've started on.
And whatever else happens, I'll be in Corsica.
It's going to be a good year.
09 June 2012
The Names.... language change in action.
I just read an article on the BBC website about the use of the definite article in the name Ukraine/The Ukraine.
It was quite interesting and raised several interesting points. (Although it listed a lot of "the" places that have recently lost the article in common speech.)
One thing that did bug me, though, was a little S-shaped oversight, because they completely missed the point that all explicitly plural proper nouns need "the". "The Netherlands", "The Phillipines", "The Bahamas".
Consider how you would refer to someone by his surname only... for example the western Alias Smith and Jones. Now if you want to use the family name in the plural, you need the definite article, eg. "keeping up with the Joneses".
The argument in the article about the fact that "the Netherlands" is made up of readily-understandable elements doesn't really hold up; there are many English-language placenames in the UK that are made of generic elements but don't take the article, like the multiple places called "Bridgend", places like "Holyhead". The archetype, though, has to be "Land's End". It is the single most meaningful placename in the whole of the archipelago -- it's iconic and valuable because of its meaning is abundantly clear. Yet we do not use the article. We used to, certainly (see the Wikipedia entry for Land's End for evidence). The historical origins are interesting, definitely, and certain classes of placenames do preserve old patterns, but language change is a subtle beast, and sometimes it isn't the form that changes, but the reasons speakers have internally for using that form.
Also, looking at the Bahamas and the Phillipines, it should also be noted that historically we didn't always name island groups in the plural, particularly in Scotland. Conservative natives of Uist, Orkney and Shetland will still refer to their homes as such, whether outsiders are likely to call them the Uists, the Orkneys and the Shetlands. (I would personally be very surprised if the English kings bent on conquering Scotland and Ireland said they wanted complete control of "the Britains" rather than simply "Britain", because this looks like a pretty new feature to me, and I suspect that it may be to do with the borrowing of French and Spanish names for new island groups in colonial times.)
The modern speaker of a modern language has no internal knowledge of the language change -- if a language encoded all its history, languages would be so "big" that they would be impossible to learn. Instead, every generation observes what the generation before says and tries to work out for themselves why they say it.
Now I know someone's going to mention shops and companies as a counter-example, because the supermarket chain is Morrisons, but that's actually a possessive. It's Morrison's supermarket, after all. Heck, when I was a child, I used to append 's to practically every single-word shop name, as did my parents. Tesco's, Bejam's, M&M's, etc. Somehow Comet got an exception, and Fine Fayre was left as-is because it was two words. Oh, and look at that: a two word generic with no definite article -- another disproof of the BBC's claim.
But you're right that Morrisons has no apostrophe these days, and neither does Greggs. This is another example of language change, because as more and more shops drop the apostrophe in order to have their brand match their website domain name (Sainsbury's are a rare case of defiance -- how long that'll last now that the Sainsbury family aren't the main shareholders is anyone's guess), and as we get more exposed to Tesco as the official name rather than Tesco's, the next generation will grow up without the cues that it's a possessive name, and even though I will hear "Greggs" as "Gregg's", they won't -- in fact, it's entirely possible that everyone under 16 already doesn't recognise it as a possessive form... and yet we speak the "same" language.
When that generation reaches their thirties and are editing and delivering our daily news, then things might change, because a generation that is entirely comfortable and happy with S at the end of their proper nouns will happily say "Netherlands" instead of "the Netherlands".
It was quite interesting and raised several interesting points. (Although it listed a lot of "the" places that have recently lost the article in common speech.)
One thing that did bug me, though, was a little S-shaped oversight, because they completely missed the point that all explicitly plural proper nouns need "the". "The Netherlands", "The Phillipines", "The Bahamas".
Consider how you would refer to someone by his surname only... for example the western Alias Smith and Jones. Now if you want to use the family name in the plural, you need the definite article, eg. "keeping up with the Joneses".
The argument in the article about the fact that "the Netherlands" is made up of readily-understandable elements doesn't really hold up; there are many English-language placenames in the UK that are made of generic elements but don't take the article, like the multiple places called "Bridgend", places like "Holyhead". The archetype, though, has to be "Land's End". It is the single most meaningful placename in the whole of the archipelago -- it's iconic and valuable because of its meaning is abundantly clear. Yet we do not use the article. We used to, certainly (see the Wikipedia entry for Land's End for evidence). The historical origins are interesting, definitely, and certain classes of placenames do preserve old patterns, but language change is a subtle beast, and sometimes it isn't the form that changes, but the reasons speakers have internally for using that form.
Also, looking at the Bahamas and the Phillipines, it should also be noted that historically we didn't always name island groups in the plural, particularly in Scotland. Conservative natives of Uist, Orkney and Shetland will still refer to their homes as such, whether outsiders are likely to call them the Uists, the Orkneys and the Shetlands. (I would personally be very surprised if the English kings bent on conquering Scotland and Ireland said they wanted complete control of "the Britains" rather than simply "Britain", because this looks like a pretty new feature to me, and I suspect that it may be to do with the borrowing of French and Spanish names for new island groups in colonial times.)
The modern speaker of a modern language has no internal knowledge of the language change -- if a language encoded all its history, languages would be so "big" that they would be impossible to learn. Instead, every generation observes what the generation before says and tries to work out for themselves why they say it.
Now I know someone's going to mention shops and companies as a counter-example, because the supermarket chain is Morrisons, but that's actually a possessive. It's Morrison's supermarket, after all. Heck, when I was a child, I used to append 's to practically every single-word shop name, as did my parents. Tesco's, Bejam's, M&M's, etc. Somehow Comet got an exception, and Fine Fayre was left as-is because it was two words. Oh, and look at that: a two word generic with no definite article -- another disproof of the BBC's claim.
But you're right that Morrisons has no apostrophe these days, and neither does Greggs. This is another example of language change, because as more and more shops drop the apostrophe in order to have their brand match their website domain name (Sainsbury's are a rare case of defiance -- how long that'll last now that the Sainsbury family aren't the main shareholders is anyone's guess), and as we get more exposed to Tesco as the official name rather than Tesco's, the next generation will grow up without the cues that it's a possessive name, and even though I will hear "Greggs" as "Gregg's", they won't -- in fact, it's entirely possible that everyone under 16 already doesn't recognise it as a possessive form... and yet we speak the "same" language.
When that generation reaches their thirties and are editing and delivering our daily news, then things might change, because a generation that is entirely comfortable and happy with S at the end of their proper nouns will happily say "Netherlands" instead of "the Netherlands".
07 June 2012
Talent Schmalent
Those of you based in the UK might remember the Channel 4 series Faking It, where a member of the public would be intensively trained over the course of 4 weeks to try to be able to "fake it" as a professional in a sphere they had never been in, but that was loosely related to their day-to-day lives. A burger-van operator turned cordon bleu chef, a punk singer made into a classical conductor. OK, so their skills weren't always entirely generalisable -- the conductor would struggle to conduct anything other than the two pieces he'd practiced, for instance -- but it was still an amazing demonstration of what an average member of the public could achieve... albeit with a more expensive regime than an average member of the public could normally afford: 24 hour a day company from people within the field.
Six years after the program stopped filming, the format has been resurrected in a slightly altered form by another studio, as Hidden Talents. Instead of finding interesting individuals and training them up, this series starts off with the potentially pseudo-scientific notion of taking hundreds of applicants and putting them through a series of exams to discover their "hidden talents" and then pick the best ones to show on the programme.
I came across this as one episode has been mentioned on a couple of language websites, what with it being based on a "hidden talent" for languages. Now I'm not convinced that they stuck to the exam results, because the guy they finally chose had a particularly tellie-friendly back story -- he left high school without doing any A-levels and was living in a homeless shelter. Now of course I'm not saying that he wasn't capable and that he didn't get high marks on the exam (he probably did) but I just find it hard to believe that they didn't play a little bit fast and loose with the figures to get the guy they wanted on screen.
Now I've seen a couple of language aptitude tests in the past, and I'm not particularly impressed. As with all tests, they can only test your current level of knowledge and not really your ability to be taught. The most thorough language tests will try to get you to deal with concepts like conjugation, declension and word order without explanation. So it says you'll pick up the initial concept pretty quickly. So what? Does the saving of half-an-hour at the start of the course make that much difference in the long term? Is the language test putting off people who would actually do just as well in the long run as those that pass? It's impossible to say, because for the most part, the people that run these tests only have data for those that passed in the first place. (If anyone knows of any blind study that's given any empirical evidence for language batteries, I'd be very interested, but I doubt any exist.)
This is OK if you're running the US Defense Language Institute, where the number of applicants vastly outstrips the number of places available as they can afford to turn lots of possibly good candidates away -- heck, they really have to turn lots of possibly good candidates away.
It's also OK if you're producing a programme like Hidden Talents, because you only need one.
However, it's a horrible message to be sending out -- that people have specific talents. On the surface it seems like a positive message (when everything goes wrong, it's just that you haven't found your talent) but it's actually pretty corrosive. How many people give up on languages and say "I haven't got the head for languages" or "I'm no good at languages"? People genuinely believe that they are inherently incapable of learning languages, with no real evidence, and it gives them an excuse to give up.
If talents exist, we still have no way of genuinely identifying them. Furthermore, these talent characteristics are miniscule compared to the potential for education. A 16-year-old coming out of a 21st century school knows almost as much about the world about them as some of the top scholars of the ancient world, and that's all down to education. As the old phrase has it, most of us are nothing more than "dwarfs standing on the shoulders of giants".
Isaac Newton did not invent this phrase -- here's the original citation for the phrase (taken from Wikipedia)
Now, whenever I say there's no such thing as talent, someone always mentions sportsmen and physical pursuits. I could claim it's a bad analogy, but actually, I don't think it is.
International competition sports can only be won by one individual (or team) out of the billions in the world, so yes, the most physically gifted generally win... if the training and equipment is equal.
But what if the training and equipment isn't equal?
I apologise for repeating myself, but I've used the example of marathons in a previous post. The reason I'm repeating myself, though, is that it was in the comments section, so people may well have missed it.
The marathon: one of the great challenges of distance running. And yet there are now people who run the length of six marathons in six days...in the Sahara desert! And that's not to mention finishing times. In the first modern Olympic Games (ooh, topical!) in 1896, the marathon was won by Spiridon Louis, in 2 hours 58 minutes and 50 seconds. As I said previously, in the 2011 London marathon, 939 people beat his time. The world record for the marathon currently stands at 2 hours 3 minutes and 38 seconds (Haile Gebrselassie). How much of these improvements is down to better training regimes and better race-day nutrition? How much is down to choice of footwear?
Overall, the lion's share of skill in any field appears to be teachable. The talented will always be "best", but only by a whisker.
We can all be "good at" anything, as long as we don't expect to be the best. After all, out of 7 billion people, it's pretty much impossible to be "best" at anything.
PS. Sorry I haven't progressed with the study of novels. I was with family at the weekend, and I never got my momentum back afterwards. I'm travelling at the weekend as I'm starting a new job on Monday. If the weather's bad where I am, I might make some progress in the evenings. Otherwise, I'll be out exploring.
Six years after the program stopped filming, the format has been resurrected in a slightly altered form by another studio, as Hidden Talents. Instead of finding interesting individuals and training them up, this series starts off with the potentially pseudo-scientific notion of taking hundreds of applicants and putting them through a series of exams to discover their "hidden talents" and then pick the best ones to show on the programme.
I came across this as one episode has been mentioned on a couple of language websites, what with it being based on a "hidden talent" for languages. Now I'm not convinced that they stuck to the exam results, because the guy they finally chose had a particularly tellie-friendly back story -- he left high school without doing any A-levels and was living in a homeless shelter. Now of course I'm not saying that he wasn't capable and that he didn't get high marks on the exam (he probably did) but I just find it hard to believe that they didn't play a little bit fast and loose with the figures to get the guy they wanted on screen.
Now I've seen a couple of language aptitude tests in the past, and I'm not particularly impressed. As with all tests, they can only test your current level of knowledge and not really your ability to be taught. The most thorough language tests will try to get you to deal with concepts like conjugation, declension and word order without explanation. So it says you'll pick up the initial concept pretty quickly. So what? Does the saving of half-an-hour at the start of the course make that much difference in the long term? Is the language test putting off people who would actually do just as well in the long run as those that pass? It's impossible to say, because for the most part, the people that run these tests only have data for those that passed in the first place. (If anyone knows of any blind study that's given any empirical evidence for language batteries, I'd be very interested, but I doubt any exist.)
This is OK if you're running the US Defense Language Institute, where the number of applicants vastly outstrips the number of places available as they can afford to turn lots of possibly good candidates away -- heck, they really have to turn lots of possibly good candidates away.
It's also OK if you're producing a programme like Hidden Talents, because you only need one.
However, it's a horrible message to be sending out -- that people have specific talents. On the surface it seems like a positive message (when everything goes wrong, it's just that you haven't found your talent) but it's actually pretty corrosive. How many people give up on languages and say "I haven't got the head for languages" or "I'm no good at languages"? People genuinely believe that they are inherently incapable of learning languages, with no real evidence, and it gives them an excuse to give up.
If talents exist, we still have no way of genuinely identifying them. Furthermore, these talent characteristics are miniscule compared to the potential for education. A 16-year-old coming out of a 21st century school knows almost as much about the world about them as some of the top scholars of the ancient world, and that's all down to education. As the old phrase has it, most of us are nothing more than "dwarfs standing on the shoulders of giants".
Isaac Newton did not invent this phrase -- here's the original citation for the phrase (taken from Wikipedia)
Bernard of Chartres used to say that we are like dwarfs on the shoulders of giants, so that we can see more than they, and things at a greater distance, not by virtue of any sharpness of sight on our part, or any physical distinction, but because we are carried high and raised up by their giant size.
Now, whenever I say there's no such thing as talent, someone always mentions sportsmen and physical pursuits. I could claim it's a bad analogy, but actually, I don't think it is.
International competition sports can only be won by one individual (or team) out of the billions in the world, so yes, the most physically gifted generally win... if the training and equipment is equal.
But what if the training and equipment isn't equal?
I apologise for repeating myself, but I've used the example of marathons in a previous post. The reason I'm repeating myself, though, is that it was in the comments section, so people may well have missed it.
The marathon: one of the great challenges of distance running. And yet there are now people who run the length of six marathons in six days...in the Sahara desert! And that's not to mention finishing times. In the first modern Olympic Games (ooh, topical!) in 1896, the marathon was won by Spiridon Louis, in 2 hours 58 minutes and 50 seconds. As I said previously, in the 2011 London marathon, 939 people beat his time. The world record for the marathon currently stands at 2 hours 3 minutes and 38 seconds (Haile Gebrselassie). How much of these improvements is down to better training regimes and better race-day nutrition? How much is down to choice of footwear?
Overall, the lion's share of skill in any field appears to be teachable. The talented will always be "best", but only by a whisker.
We can all be "good at" anything, as long as we don't expect to be the best. After all, out of 7 billion people, it's pretty much impossible to be "best" at anything.
PS. Sorry I haven't progressed with the study of novels. I was with family at the weekend, and I never got my momentum back afterwards. I'm travelling at the weekend as I'm starting a new job on Monday. If the weather's bad where I am, I might make some progress in the evenings. Otherwise, I'll be out exploring.
30 May 2012
Authentics: long and short pt 6
Well it seems like I spend as much time writing up my figures as generating them, which is a good thing as it's when I'm writing that I'm most actively considering the implications of what I've done so far.
The next set of figures took very little time to generate, and gave me something to think about. First I copied the code that identifies all types containing uppercase characters, and wrote a revised version that checks if the same type occurs elsewhere in all lowercase. This gives me two sets of types -- those "not" in lowercase and those "never" in lowercase. I wanted to examine the difference to see how this affects my previous results.
I then generated the following table for John Buchan's first three Richard Hannay novels (The 39 Steps, Greenmantle and Mr. Standfast):
Excellent -- at least two thirds of the types I've been ignoring are still ruled in as candidates, so my previous figures and their conclusions aren't necessarily invalid.
So, with the lists shrinking, I could get a closer look at them. I've already pointed out the problem with "I" -- it's always in uppercase, so always a candidate for elimination. I can hard-code it as an exception, but for now I want the code to be as general and language-agnostic as possible -- and besides, it's one word in thousands.
One other language-specific problem that comes up is the fact that nationalities and languages are capitalised in English, but not in other languages, and it's fair to say that most learners of English would be expected to know/learn words like "English", "Scottish" and "Irish", so it's not necessarily right to rule them out. And of course they would also know "England", "Scotland" and "Ireland", so I'm no longer even sure that ruling out proper nouns leaves us with a more valid measure of difficulty for language learners.
Looking at the list also indicated a problem technology-wise: NLTK's word_tokenize leaves punctuation in with its tokens, so "end." is a different from "end", and any word at the start of a passage of direct speech is likely to be considered a new token. This skews my type:token ratios slightly, but I can't be sure whether it's statistically relevant. But it's the question of direct speech that is making me think most. Consider the real examples (from the 39 Steps) of "'Let" and "'Then". You can be pretty sure that "let" and "then" will have occurred before this. But notice that the words with the leading quotemark start with capitals, as direct speech in prose tends to do. This means they're being eliminated. So do I really need to account for this? Is it statistically significant enough to worry about?
If I was doing this as real, serious research, I would need to write something to strip out appropriate punctuation (or, better, find something that someone else has written to do the same thing).
Anyway, so I decided to throw in another book and see how it came out.
In Pride and Prejudice by Jane Austen, only 49% of the "not lowercase" tokens were present in the "never lowercase" set. Curious, I decided to expand my figures a bit...
OK, that last row's something new: the "never:not" ratio was something I calculated to work out how innacurate I'd been previously -- it's not something that offers any meaningful results as such -- so I wanted a measure that constitutes a goal in itself: "never:types" is the ratio of the types in the "never lowercase" category as a proportion of all types. It's intended as a rough estimate of the density of proper nouns in a text (still incorporating all the inaccuracies previously discussed). It's notable how both the never:not and never:types ratios are so consistent within the three Buchan novels, and yet so different from Austen's writing. You'd probably expect that, though -- personal style and genre affect this greatly (the Hannay novels involve a lot of travel, Pride and Prejudice is restricted in terms of characters and locations).
All this umming and ahhing over proper nouns is getting distracting. For now, I'll proceed with the "never lowercase" types as a rough estimate. I'm not really looking for accurate numbers anyway, just proportions, and this should be good enough for now.
I really need to move on and look at the question that was biggest in my mind when I started out: what's the difference between reading a series of books and reading multiple books by the same author?
So it's back to Project Gutenberg to find some likely candidates. Émile Zola vs Victor Hugo, perhaps....
The next set of figures took very little time to generate, and gave me something to think about. First I copied the code that identifies all types containing uppercase characters, and wrote a revised version that checks if the same type occurs elsewhere in all lowercase. This gives me two sets of types -- those "not" in lowercase and those "never" in lowercase. I wanted to examine the difference to see how this affects my previous results.
I then generated the following table for John Buchan's first three Richard Hannay novels (The 39 Steps, Greenmantle and Mr. Standfast):
The 39 Steps | Greenmantle | Mr Standfast | |
No of tokens | 44625 | 107409 | 140425 |
Type:token | 14% | 11% | 10% |
Tokens not in lowercase | 919 | 1708 | 2262 |
Tokens never in lowercase | 604 | 1193 | 1565 |
never:not | 66% | 70% | 69% |
Excellent -- at least two thirds of the types I've been ignoring are still ruled in as candidates, so my previous figures and their conclusions aren't necessarily invalid.
So, with the lists shrinking, I could get a closer look at them. I've already pointed out the problem with "I" -- it's always in uppercase, so always a candidate for elimination. I can hard-code it as an exception, but for now I want the code to be as general and language-agnostic as possible -- and besides, it's one word in thousands.
One other language-specific problem that comes up is the fact that nationalities and languages are capitalised in English, but not in other languages, and it's fair to say that most learners of English would be expected to know/learn words like "English", "Scottish" and "Irish", so it's not necessarily right to rule them out. And of course they would also know "England", "Scotland" and "Ireland", so I'm no longer even sure that ruling out proper nouns leaves us with a more valid measure of difficulty for language learners.
Looking at the list also indicated a problem technology-wise: NLTK's word_tokenize leaves punctuation in with its tokens, so "end." is a different from "end", and any word at the start of a passage of direct speech is likely to be considered a new token. This skews my type:token ratios slightly, but I can't be sure whether it's statistically relevant. But it's the question of direct speech that is making me think most. Consider the real examples (from the 39 Steps) of "'Let" and "'Then". You can be pretty sure that "let" and "then" will have occurred before this. But notice that the words with the leading quotemark start with capitals, as direct speech in prose tends to do. This means they're being eliminated. So do I really need to account for this? Is it statistically significant enough to worry about?
If I was doing this as real, serious research, I would need to write something to strip out appropriate punctuation (or, better, find something that someone else has written to do the same thing).
Anyway, so I decided to throw in another book and see how it came out.
In Pride and Prejudice by Jane Austen, only 49% of the "not lowercase" tokens were present in the "never lowercase" set. Curious, I decided to expand my figures a bit...
The 39 Steps | Greenmantle | Mr Standfast | Pride & Prejudice | |
No of tokens | 44625 | 107409 | 140425 | 138348 |
No of types | 6461 | 11381 | 13904 | 8171 |
Type:token | 14% | 11% | 10% | 5.9% |
Tokens not in lowercase | 919 | 1708 | 2262 | 669 |
Tokens never in lowercase | 604 | 1193 | 1565 | 300 |
never:not | 66% | 70% | 69% | 45% |
never:types | 9.3% | 10.5% | 11.3% | 3.7% |
OK, that last row's something new: the "never:not" ratio was something I calculated to work out how innacurate I'd been previously -- it's not something that offers any meaningful results as such -- so I wanted a measure that constitutes a goal in itself: "never:types" is the ratio of the types in the "never lowercase" category as a proportion of all types. It's intended as a rough estimate of the density of proper nouns in a text (still incorporating all the inaccuracies previously discussed). It's notable how both the never:not and never:types ratios are so consistent within the three Buchan novels, and yet so different from Austen's writing. You'd probably expect that, though -- personal style and genre affect this greatly (the Hannay novels involve a lot of travel, Pride and Prejudice is restricted in terms of characters and locations).
All this umming and ahhing over proper nouns is getting distracting. For now, I'll proceed with the "never lowercase" types as a rough estimate. I'm not really looking for accurate numbers anyway, just proportions, and this should be good enough for now.
I really need to move on and look at the question that was biggest in my mind when I started out: what's the difference between reading a series of books and reading multiple books by the same author?
So it's back to Project Gutenberg to find some likely candidates. Émile Zola vs Victor Hugo, perhaps....
29 May 2012
Authentic: long v short pt 5
Today was a wee bit frustrating. I spent a solid chunk of time trying to get the ntlk data module installed, and with it the file english.pickle that would have allowed me to do part-of-speech (POS) tagging. This would have made it almost trivially easy to eliminate the proper nouns and get a genuine look at the real "words" that are of interest to the learner.
Ah well, it looks like it wasn't meant to be.
So I started working towards custom code to eliminate the proper nouns manually, something which would be handy in the future anyway. The first step was to identify some candidates for further inspection, and seeing as I'm working with English, that's pretty easy: if it's not all in lower case, there's something funny about it. I wrote the code to identify all the tokens (words) that contained capitals. Yes, at this point I could have checked whether it was the start of the sentence or not, but that wouldn't have really helped, because proper nouns occur at the start of sentences too, so i'd still need to check.
When I generated my set of candidates, though, it was a little long. For The 39 Steps, I was looking at 919 tokens to check manually, and that's a fairly short book. As I'm doing this for fun, it seemed like checking that many would be a little bit boring, particularly in longer books. (I later checked the candidate set for the 3 books in total, and it turned out to be over 3000 words, which is more than my time's worth.)
My first quick test then was to have a look at the difference in figures. Eliminating every single item with any capitals in it drops the type:token ratio in The 39 Steps from 14.48% to 13.14% -- that's almost a a 10% drop (it's 1.35 percentage points, but it's 9.27 percent). Before properly addressing the proper nouns, I wanted to see how big a difference this crude adjustment makes to the figures. It seemed just a little too high to realistically be led by proper nouns alone. But can that be? I mean, how many words are likely to occur only at the start of sentences?
So on I went, hoping that the data I could generate at this stage would start to shed some light on this figure.
The first graph I produced showed me the running type:token ratios and introduction rates for both the full token set, and the token set with non-lowercase words eliminated:
The two pairs of lines follow each other pretty closely, getting closer together as they progress. But in order to start getting a clear idea of what was going last time, I had to go to another level of abstraction and measure some useful differences. So here is the difference between the running ratios for all words and lowercase only, and the corresponding difference in introduction rates:
Now you'd be forgiven for thinking that the difference is diminishing here -- I was fooled into thinking the same thing, but then I realised I was dealing with numbers here rather than proper stats, and I redid the analysis but with a difference in percentage:
The overall running type:token ratio does indeed decrease, but it halves (20% down to 10%) then stabilises. The introduction rate, on the other hand, is all over the place -- there's no identifiable trend at all. Even subsampling my data didn't give any clear and understandable trends (and since I'm using a desktop office package for my analysis it's a bit of faff to do the resampling automatically -- it's just further proof that I need to get myself familiar with the statistical analysis tools for Python (eg numpy), but my head's full with the NLTK stuff for now, so I'll leave the improved statistical stuff for
another time). Here's the same graphs, but with 2000 word samples instead of 500 word samples:
So not promising, really. Still no stable, identifiable trends.
Books as a series
But I had all the infrastructure in place now, so I figured I might as well rerun the analysis on the 3 books as a single body and see what came out. Let's just go straight to the relative difference between the lines for all words and eliminating all words not entirely in lower case:
Oooh... now where did I leave those figures on where the individual books started...? 44625 and 152034, and there's a notable period of high difference (20-30%) from about 45000 words, and that massive spike you seem on the graph -- which is actually a 63.64% difference -- occurs from 152000-152500.
Bingo: we've got decent support for Thrissel's suggestion that a lot of proper nouns are introduced early on in... at least some novels.
Not the sort of information I was originally looking for, but actually quite interesting. It's kind of turning the project in a slightly different direction than I had planned. I'll just have to go with the flow.
What I did wrong today
One of the minor irritations of the day was when I started writing up my results, and after having done the coding, data generation and analysis, I realised a fairly simple refinement I could have made. It was a real *palmface* moment: I could have simply taken my first list of candidate proper nouns and eliminated any candidates that also appeared completely in lower case. Having done that, I would have been left with a much shorter list of candidates, and it may well have been worth my time manually checking the results.
>sigh<
But of course, that's as much the point of the exercise as anything: to work through the process and the problems and to start thinking about what can be done better.
It also occurs to me now that I also managed to eliminate every single occurrence of the word I from the books! Quite a fundamental error, even if it only made a minute difference to the final ratios.
Perhaps I'm being a little too "hacky" in all this. I'll have to pick up my game a bit soon....
Ah well, it looks like it wasn't meant to be.
So I started working towards custom code to eliminate the proper nouns manually, something which would be handy in the future anyway. The first step was to identify some candidates for further inspection, and seeing as I'm working with English, that's pretty easy: if it's not all in lower case, there's something funny about it. I wrote the code to identify all the tokens (words) that contained capitals. Yes, at this point I could have checked whether it was the start of the sentence or not, but that wouldn't have really helped, because proper nouns occur at the start of sentences too, so i'd still need to check.
When I generated my set of candidates, though, it was a little long. For The 39 Steps, I was looking at 919 tokens to check manually, and that's a fairly short book. As I'm doing this for fun, it seemed like checking that many would be a little bit boring, particularly in longer books. (I later checked the candidate set for the 3 books in total, and it turned out to be over 3000 words, which is more than my time's worth.)
My first quick test then was to have a look at the difference in figures. Eliminating every single item with any capitals in it drops the type:token ratio in The 39 Steps from 14.48% to 13.14% -- that's almost a a 10% drop (it's 1.35 percentage points, but it's 9.27 percent). Before properly addressing the proper nouns, I wanted to see how big a difference this crude adjustment makes to the figures. It seemed just a little too high to realistically be led by proper nouns alone. But can that be? I mean, how many words are likely to occur only at the start of sentences?
So on I went, hoping that the data I could generate at this stage would start to shed some light on this figure.
The first graph I produced showed me the running type:token ratios and introduction rates for both the full token set, and the token set with non-lowercase words eliminated:
The two pairs of lines follow each other pretty closely, getting closer together as they progress. But in order to start getting a clear idea of what was going last time, I had to go to another level of abstraction and measure some useful differences. So here is the difference between the running ratios for all words and lowercase only, and the corresponding difference in introduction rates:
Now you'd be forgiven for thinking that the difference is diminishing here -- I was fooled into thinking the same thing, but then I realised I was dealing with numbers here rather than proper stats, and I redid the analysis but with a difference in percentage:
The overall running type:token ratio does indeed decrease, but it halves (20% down to 10%) then stabilises. The introduction rate, on the other hand, is all over the place -- there's no identifiable trend at all. Even subsampling my data didn't give any clear and understandable trends (and since I'm using a desktop office package for my analysis it's a bit of faff to do the resampling automatically -- it's just further proof that I need to get myself familiar with the statistical analysis tools for Python (eg numpy), but my head's full with the NLTK stuff for now, so I'll leave the improved statistical stuff for
another time). Here's the same graphs, but with 2000 word samples instead of 500 word samples:
Books as a series
But I had all the infrastructure in place now, so I figured I might as well rerun the analysis on the 3 books as a single body and see what came out. Let's just go straight to the relative difference between the lines for all words and eliminating all words not entirely in lower case:
Oooh... now where did I leave those figures on where the individual books started...? 44625 and 152034, and there's a notable period of high difference (20-30%) from about 45000 words, and that massive spike you seem on the graph -- which is actually a 63.64% difference -- occurs from 152000-152500.
Bingo: we've got decent support for Thrissel's suggestion that a lot of proper nouns are introduced early on in... at least some novels.
Not the sort of information I was originally looking for, but actually quite interesting. It's kind of turning the project in a slightly different direction than I had planned. I'll just have to go with the flow.
What I did wrong today
One of the minor irritations of the day was when I started writing up my results, and after having done the coding, data generation and analysis, I realised a fairly simple refinement I could have made. It was a real *palmface* moment: I could have simply taken my first list of candidate proper nouns and eliminated any candidates that also appeared completely in lower case. Having done that, I would have been left with a much shorter list of candidates, and it may well have been worth my time manually checking the results.
>sigh<
But of course, that's as much the point of the exercise as anything: to work through the process and the problems and to start thinking about what can be done better.
It also occurs to me now that I also managed to eliminate every single occurrence of the word I from the books! Quite a fundamental error, even if it only made a minute difference to the final ratios.
Perhaps I'm being a little too "hacky" in all this. I'll have to pick up my game a bit soon....
28 May 2012
Authentics: long v short - pt 4
Well, I had a nice weekend and visited some friends in Edinburgh for a wedding. The weather's too good to spend too much time inside, so I'll just write up a few more tests then go and enjoy the sunshine.
Multiple books in a series
Today's figures come from two of the books I've already mentioned -- John Buchan's The 39 Steps and Greenmantle, and the next book in the series: Mr Standfast.
Again, the graphs at different sample sizes show different parts of the dataset more clearly than others:
The graph as 1000 words is too unstable to clearly identify the end of the the first book, and the start of the second book is only identifiable because the introduction rate is greater than the running ratio for the first time in any of my tests.
The 2500 and 5000 word samples give us a clear spike for the end of the first book, but the end of the second book is obscured slightly by noise, and becomes clearer again in the 10000 and 25000 word sample sizes, although the end of the first book is completely lost by the time we reach the 25000 word sample graph.
Having done all that, I went back and verified the peaks matched the word counts -- The 39 Steps is 44625 words long, and Greenmantle is 107409 words long, so ends with the 152034th word. The peaks on the graphs all occur shorlty after 45000 and 150000.
It was the first graph, from the 1000 word sample set, that piqued my curiosity. Having spotted the two lines intersecting at the start of the third book, I decided to check the difference between the running type:token ratio and the introduction rate, and I graphed that. At 1000 word samples, there was still too much noise:
However, given that I already knew what I was looking for, I could tell that it showed useful trends, and even just moving up to 2500 word samples made the trends pretty clear:
Going forward, I need to compare the difference between books in a series, books by the same author (but not in a series) and unrelated books, and I believe that the difference between the running type:token ratio and the introduction rate may be the best metric to use in comparing the three classes.
Problems with today's results
I can't rule out that the spikes in new language at the start of each book aren't heavily influenced by the volume of proper nouns, as Thrissel suggested, so I'm probably going to have to make an attempt at finding a quick way of identifying them. The best way of doing this would be to write a script that identifies all words with initial caps in the original stream, then asks me if these are proper nouns or not.
By treating the three books as one continuous text in the analysis, it looks like I've inadvertently smoothed out the spike somewhat at the start of each book. In future I should make sure individual samples are taken from one book at a time so that the distinction is preserved.
Multiple books in a series
Today's figures come from two of the books I've already mentioned -- John Buchan's The 39 Steps and Greenmantle, and the next book in the series: Mr Standfast.
Again, the graphs at different sample sizes show different parts of the dataset more clearly than others:
The graph as 1000 words is too unstable to clearly identify the end of the the first book, and the start of the second book is only identifiable because the introduction rate is greater than the running ratio for the first time in any of my tests.
The 2500 and 5000 word samples give us a clear spike for the end of the first book, but the end of the second book is obscured slightly by noise, and becomes clearer again in the 10000 and 25000 word sample sizes, although the end of the first book is completely lost by the time we reach the 25000 word sample graph.
Having done all that, I went back and verified the peaks matched the word counts -- The 39 Steps is 44625 words long, and Greenmantle is 107409 words long, so ends with the 152034th word. The peaks on the graphs all occur shorlty after 45000 and 150000.
It was the first graph, from the 1000 word sample set, that piqued my curiosity. Having spotted the two lines intersecting at the start of the third book, I decided to check the difference between the running type:token ratio and the introduction rate, and I graphed that. At 1000 word samples, there was still too much noise:
However, given that I already knew what I was looking for, I could tell that it showed useful trends, and even just moving up to 2500 word samples made the trends pretty clear:
Going forward, I need to compare the difference between books in a series, books by the same author (but not in a series) and unrelated books, and I believe that the difference between the running type:token ratio and the introduction rate may be the best metric to use in comparing the three classes.
Problems with today's results
I can't rule out that the spikes in new language at the start of each book aren't heavily influenced by the volume of proper nouns, as Thrissel suggested, so I'm probably going to have to make an attempt at finding a quick way of identifying them. The best way of doing this would be to write a script that identifies all words with initial caps in the original stream, then asks me if these are proper nouns or not.
By treating the three books as one continuous text in the analysis, it looks like I've inadvertently smoothed out the spike somewhat at the start of each book. In future I should make sure individual samples are taken from one book at a time so that the distinction is preserved.
Subscribe to:
Posts (Atom)