Lingua Frankly: language

Showing posts with label language. Show all posts

16 January 2014

Is language like science...?

Quite often, when I talk about the rules of language, I find I get hit with the response "Language isn't like science!" When I talk about teaching language systematically, people say "Language isn't like science." When I talk about language in schools, I'm told it's destined to fail because "Language isn't like science."

Well, I'm in the middle of trying to sort through a lot of old stuff that had been stored in the loft, and I came across a piece of paper on which I had hastily scrawled the following:

Language isn't like science.

Why?

It's about choosing the rules, not knowing the rules.

Now this is not a statement of my belief; rather it's my attempt to understand the logic behind the statement, so for anyone other than myself to get my full meaning requires a bit more explanation.

The reason people say "language isn't like science" is because of their misconception of the nature of science. To them, science has been presented as a series of rules to be memorised. They have been conditioned to think that the end goal of science is to be able to regurgitate the rules on demand, because that's all that was required of them in school.

That is not science.

Science is the art of investigating natural phenomena and finding explanations and models for them. These explanations and models are mostly a combination and application of existing scientific rules, and sometimes of identifying and creating new rules.

Or to put it another way: science is not about the knowledge of rules, it's about the application of rules.

But would the same statement not hold for language too? Language is not about the knowledge of rules (we can all agree on that) but the application of rules, surely?

Science is very often taught badly, in that there is such a focus on the rules themselves that students never get the chance to integrate those rules into a working body of scientific knowledge. This leaves the student able to recall the rule or law by name, but not recall the rule when addressed with a problem that requires that rule in order to reach a solution. You cannot solve a useful scientific problem this way -- the only type of problem that can tell you explicitly which rules are required to solve it is a problem that has already been solved, and science is about creating new knowledge, not repeating the known ad nauseum.

A good course in science will instead train the student in identifying the characteristics of a problem domain and noticing patterns that relate to particular laws or rules: they will teach them how to select the appropriate rule for the given situation.

That, I contest, is the very same process we go through when we try to formulate an utterance. We have a bank of words and grammatical rules at our disposal, and we have to select the appropriate items from it to express the message that we want.

So language is a lot like science, and the objections typically raised against grammar teaching are systemic problems that also affect science teaching. It's a problem that the late, great Richard Feynman recounted in his memoir Surely You're Joking, Mr Feynman?, when he talks of his experience on sabbatical placement in Brazil. It's a problem that affects all education systems to a greater or lesser extent.

But the problem comes when reformers attempt to throw the baby out with the bathwater: "rules teaching has failed," they tell us, "so we need to do away with rules."

That, to me, is a ridiculous philosophy. How can you choose which rule to apply if you don't know what rules exist? How can you search for it if you don't know what it is?

Let's be clear, I do not have to be able to recite the present tense endings of regular -ARE verbs in Latin in order to usefully "know" the rule, but that doesn't mean I shouldn't be taught them. I initially learned Spanish, for example, by the explicit teaching of the endings, and the explicit teaching of rules like 2s = 3s+"s" and 3p=3s+"n" (NB: this is my notation, not the way I was taught the rule!), and not by memorising the list of conjugations or a table. But that was still explicit teaching. I did not learn by osmosis, I did not learn by exposure, I did not learn by magic. I was told what my range of choices was, then given sufficient opportunities to make those decisions that I eventually could make the decision subconsciously.

25 May 2012

Authentics: long vs short part 3

Excuse the slight change of title -- I figured the original long title was probably getting truncated in people's feeds, so I wanted to abbreviate it. If you've been following my blog recently, you should have already seen my previous two posts on my little project; I am trying to investigate whether my normal advice that long fiction (novels or TV serials) is better than short fiction (short stories and feature films) for the learner.

Sample sizes
Anyway, as I said last time, I wanted to start comparing a fixed length of text, rather than variable-length chapters as my benchmark. I was looking for a sampling length that would give a clear picture of the overall progression without having too much interference from little local fluctuations. My first set of results suggests that this is a fool's errand. The following set of images shows the graphs for the novel Greenmantle by John Buchan, with samples taken ever 1000, 2500, 5000 and 1000 words.

While using larger samples gives a much smoother line, it also unfortunately obliterates some of the most important detail in the graph, in that we start to lose the steep drop at the start -- that's information that's really crucial to my investigation, so I'll have to make put up with various humps and wiggles in the line for now. However, that's not to say that the other graphs aren't interesting in and of themselves -- the little hump at around 50000-60000 words in the 5000 word sample version suggests that something important may be happening at this point in the story, causing a batch of new vocabulary to be introduced, or perhaps the introduction of a new character with a different style of speech. Anyway, as interesting as that may be, it would be a diversion from the matter at hand.

Alternatively, I could move away from using linear sampling/projections and start charting using logarithmic or exponential data, and while now would be a good time to start refreshing my memory on that sort of statistical analysis, it also risks diverting me from the task at hand, and I'm following the Coursera.org machine learning course currently, so I should be able to get the computer to do the work itself in a few weeks anyway. Besides, I've still not got myself a high-frequency word list, and the pattern might be completely different once I've eliminated common words of English from the equation.

So for now I'll stick to working with multiple sample sizes. I'll admit to being a bit simplistic in my approach to this so far, as I ran my little Python program once for every sample size, rather than just running it once with the smallest sample size then resampling the data.

The program I'm using at the moment is pretty straightforward:

def collect_stats (token_list, step_size):
   i=0
   return_array=[];
   while (i      i += step_size
      running_types_total = number_of_types (token_list[:i])
      if (i         return_array.append([i,running_types_total])
      else :
         return_array.append([len(token_list),running_types_total])
   return return_array

This takes an NLTK token list (it would work with any simple list of strings too, though) and the size of samples to be taken, then builds up a list of lists [[a1,b1],[a2,b2],...] where each a is the number of the last word included in the sample, and each b is the number of unique tokens from the beginning of the text to the ath word.

The number_of_types function just returns len(set(w.lower() for w in token_list)).

This means that at every stage I have a running total of tokens, and it's only when I want to produce a graph that I calculate the number of new tokens in the given slice (= b(n) - b(n-1)), and there's therefore no reason why I can't skip several slices to decrease my sampling rate (eg b(n) - b(n-3)).

Next up
I've taken a running sample of three books from the same series -- The 39 Steps, Greenmantle and Mr Standfast, and run them through as one text, so I'll look at the output of that next, but I don't think it'll be much use until I've got something to compare with -- either/both of: a selection of novels by one author that aren't a series; and a selection of novels by different authors.

29 December 2011

Counterintuitive, perhaps, but sometimes it's easier to start with the harder material...

In general, whenever we teach or learn something new, we start with the easy stuff then build on to the more difficult stuff. But this isn't always a good idea, because sometimes the easy stuff causes us to be stuck in a "good enough" situation.

When I started learning the harmonica, I learned to play with a "pucker technique", ie I covered the wholes with my lips. The alternative technique of "tongue blocking" (self descriptive, really), was just "too" difficult for me as a learner. So for a long, long time, the pucker was "good enough" and tongue blocking was too difficult for not enough reward. It limited my technique for a good number of years, and now that I can do it, I wish I'd learnt it years ago.

The same block of effort vs reward happens in all spheres of learning. If you learn something easy, but of limited utility, it's far too easy to just continue along doing the same old thing, and it's far too difficult to learn something new, so you stagnate. Harmonicas, singing, swimming, skiing, mathematics, computer programming; there's always the temptation to just hack about with what you've got rather than learn a new and appropriate technique.

This problem, unsurprisingly, rears its ugly head all too often in language learning, but with language it has an altogether insidious form: the "like your native language" form. If you've got a choice of forms, one is going to be more like your native language than the other, and this is therefore easier to learn. Obviously, this form is going to be "good enough", and the immediate reward to the learner for learning the more difficult form (ie different from the native language) isn't enough to justify the effort. However, in the long term, the learner who seeks mastery is going to need that form in order to understand language encountered in the real world.

The problem gets worse, though, when you're talking about dialectal forms.

Here's an example. Continuous tenses in the Celtic languages traditionally use a noun as the head verbal element (known as the verbal noun or verb-noun). I am at creation [of] blog post, as it were. Because it's a noun, the concept of a "direct object" is quite alien, and instead genitives are used to tie the "object" to the verbal noun. In the case of object pronouns, they use possessives. I am at its creation instead of *~~I am at creation [of] it~~. Note that the object therefore switches sides from after to before the verbal noun.

Now in Welsh, the verbal noun has become identical to the verb root, and is losing its identity as a noun. This has led to a duplication of the object pronoun, once as a possessive, once as a plain pronoun -- effectively I am in its creation [of] it. This really isn't a stable state, as very few languages would tolerate this sort of redundancy, and the likely end-state is that the possessive gets lost, and the more English-like form (I am in creation [of] it) will win out. In fact, there are many speakers who already talk this way.

But for the learner, learning this newer form at the beginning is a false efficiency. There are plenty of places where the old form is still current, so unless the learner knows for certain that they'll be spending their time in an area with the newer form, they're going to need the conservative form anyway. To a learner who knows the conservative form, adapting to the newer form is trivially easy, but for someone who knows only the newer form, the conservative form is really quite difficult to grasp.

So teaching simple forms early risks restricting the learner's long-term potential. So while you want to make life simple for yourself or you students, make sure you're not doing them or yourself a disservice.

24 December 2010

Dialogues from Day One.

I discussed dialogues briefly in an earlier post on expository and naturalistic language. Fasulye suggested in the comment section that dialogues didn't necessarily lead to the use on unnaturalistic language. OK, so I didn't say that it did -- the point I raised was that dialogues aren't a "magic bullet" that makes all language seem naturalistic.

However, that said, I'm not a big fan on dialogues anyway, so today I'm going to talk about how starting a course with dialogues from the very first lesson actually slows down progress for the learner.

My contention:
The need for a coherent dialogue forces the author to use language that the student isn't yet ready to understand.
The dialogue format forces the learner to move between such a variety of different language, that it forces the student to attempt to learn too many things at once.

I'll use as my example one of the ever-popular Teach Yourself books.

Lesson 1 TY Welsh opens with the following dialogue (my translation)
Matthew: Good morning.
Elen: Good morning. Who are you?
Matthew: I'm Matthew.
Elen: How's things? I'm Elen, the Welsh course tutor.
Matthew: I'm a learner, a very nervous learner!
Elen: Welcome to Lampeter, Matthew. Don't be nervous, everything will be fine.

What do we start off with? It's those old favourites -- hello, what's your name etc.

But what does this teach us?

Let's have a look at the Welsh for "who are you" and "I'm Matthew": "Pwy dych chi?" and "Matthew ydw i".

These two phrases are completely alien to the English speaker. There is only one clue that the English speaker can use to try to make sense of this -- the name "Matthew". A learner might assume that "pwy" and "ydw" are linked, but they're not -- "dych" goes with "ydw", even though the two are not visibly related.

This is the verb "to be", and this problem isn't unique to Welsh -- consider the English "are", "am" and "is". So even when we look at dialogues from an entirely expository point of view, we have a problem that means we have too many unknowns for the new learner.

Consider the following (not a real example) as though it was in lesson one:
John: Are you tired?
Sally: Yes, I am tired.

You as a learner are asked to contrast the question with the answer, but we have a massive amount of variation in a very simple sentence. First of all, we have the matter of the irregular verb forms, as above. Secondly, the pronouns are radically different (as in most languages). Finally, we have a change of word order. Learners could confuse their verbs and pronouns, and miss the word order entirely.

OK, that's not a real lesson 1 example, but I've already given a worse example from the Welsh course - Pwy dych chi?. In the Welsh, the word order doesn't change for the answer Matthew ydw i, but that's arguably as difficult for an English speaker as English word order is for speakers of a language that doesn't change order. We also have no repeated recognisable word form to highlight any the word order in Welsh. There is an awful lot of rules in play here, each interacting to make the full meaning of the sentence. Without seeing these in isolation, the role of individual elements is obscured.

And it's even more complicated in French. Many courses will introduce Comment t'appelles tu? and the response Je m'appelle Jean-Pierre (or whatever name). This introduces the complication of the reflexive pronoun, which is a version of the object pronoun. Well, actually, the reflexive pronoun is identical to the normal object pronoun for "me" and "you", which actually makes this more confusing. While the change of word order for the question is theoretically the same as English, the lack of auxiliary do (eg Do you know?) in French questions makes it completely different to the untrained eye. The fact that this places the object before the subject is particularly alien to the English speaker. This is massively difficult, and so the learner is only expected to memorise or learn to recognise the phrase. The assumption here is that by exposure to later examples, the learner will induce the underlying patterns, but this is something that dialogues are actually very bad at.

Dialogues by their nature attempt to model naturalistic conversations, and this leads them to include a very wide variety of language. Unfortunately, variety means very little repetition, so there is very little material to induce the rules from. It gets worse when the writer is trying particularly hard to be naturalistic, because many of the expository cues are lost. Remember this from earlier? I'm a learner, a very nervous learner! Notice that this uses elision (the ommission of repeated words) for increase naturalisticness, but missing the opportunity to reinforce the structure "I am".

French courses rarely follow up the je m'appelle with any other reflexive constructions -- the only thing it is contrasted with is usually il/elle s'appelle (he/she/it is called). The student is left knowing the phrase for a long time without being given the input to learn why it means what it means. In fact, this risks interfering with normal (non-reflexive) object pronouns, because the learner is overexposed to the reflexive form, and unexposed to the base form for a long time.

The root cause of the problem

The language in a naturalistic dialogue is linked by context, and elision is a major feature of natural language.
In short, we actively avoid repeating language in a conversation.

This leaves us teaching language that is only bound by context, so is semantically reinforcing, but not syntactically reinforcing.

If we progress in a language by learning a new word, it opens up a few extra possibilities, but learning new grammatical structures can double our knowledge of the language.

So imagine you know "I like...", "I have..." and "cars", "trees" and "dogs" -- you can say 6 combinations. If you next learn to say "cats", that's an additional two sentences -- "I like cats" and "I have cats" -- so 8 in total.

But if instead you learn the negation "don't", that doubles the number of sentences to 12.

Massive growth in beginner language is only possible if you focus on teaching language points that can be combined within a sentence to make bigger and more complicated sentences. The dialogue format militates against this, and after one dialogue-based lesson, a learner is not likely to be able to produce even as much as is in the dialogues themselves. Compare with the Michel Thomas courses where (even excluding the -ible/-able words) the learner has a range of expression that while limited still covers dozens of different possible sentences. By building on this, the student experiences almost exponential growth. That's cool.

25 October 2010

Why English is a poor international language.

English is now the international language of trade and commerce, but it's not fit for purpose. That's not to say any other language genuinely is either. For all the spelling quirks, inconsistent borrowings and weird pronunciations in English, the most important problem, to my mind, is the result of the natural evolution of language.

Language evolved to be spoken, for face-to-face communication. It's only modern technology that has allowed remote communications (written and by telephone) to really take off.

Languages take advantage of the face-to-face medium implicitly. We have three grammatical "persons". The first is who is speaking, the second is who is being spoken to, the third is anybody else -- absolutely anybody else.

To me, this is one of the concrete physical underpinnings of language, which is not as abstract as some would like to think.

Ramachandran and Hubbard put forward the case for language as a synaesthetic phenomenon. Even if this is overstating the case, their theory uses the proximity of the auditory parts of the brain with the parts involved in physical movements. Signers have often held that they are not "reading" or "writing" when they engage in a sign-language conversation, but speaking, and it has long been accepted that sign languages are genuine languages, not mere abstract codes. Other academics than Ramachandran and Hubbard that language was a series of gestures, but that they just happened to be gestures of the mouth. Ramachandran and Hubbard merely suggest a mechanism that would allow us to perceive these gestures in the absence of visual data.

But I'm at risk of digressing here, as this theory is something I find absolutely fascinating.

The 1st, 2nd, 3rd distinction is not just about people, but also more generally about location. Many languages have 3 words where English only has "here" and "there". If you think about it, "here" is "where I am". "There" is merely anywhere that I am not. However, in Gaelic, "an seo" is where I am, "an sin" is where you are, and "an siud" is where neither of us are -- a "third place", effectively. In older English we had "here", "there" and "yonder", and we still have remnants of this distinction in the phrase "this, that and the other".

This is where the physicality comes in. When we talk face to face, I can point to "you" and "me" unambiguously, but third parties would be a vague wave off to one side. Now, because you can see me, and I can see you, we know lots about who's speaking, not least of which is gender. Very few languages encode gender in their 1st and 2nd plural pronouns because it's not information that really is particularly useful. But in the 3rd person, it helps a great deal, because it helps us categorise and reduce the number of potential candidates. It lets us talk about 2 people without confusing them, if they happen to be of a different sex -- so he says, and she says, and he says...

But now on the internet, with text based communications and screen-names that are often not real names and give no clues about gender, what are we to do? In French, it's possible that someone would give themselves away by using a gender-specific adjective, but these are vanishingly rare in English. So when I refer to someone else's comments, I often end up arbitrarily ascribing a gender to them. And it's normally male, which often winds up women.

This alone can be sorted by using a truly ungendered language (Quechua as the new language of the internet, anyone?) but there's another problem that slips a lot of people by: in text-based conversations, "you" is also prone to misinterpretation.

Think about it. I can't see you. You can't see me. Am I really talking to you? Perhaps I'm talking to someone else, and calling him (or her!) "you". But you think I'm talking to you, because you're seeing that word "you" and there's no reason to think it's someone else. In the physical world, you would hear me saying "you" and you would see whether I was looking at you as I said it or not. In the internet there is no physical relationship, no "pointing", so the boundary between 2nd and 3rd person has been completely broken. This leads to confusion and unnecessary offense so often, not just on the text-based side of the internet, but in conference calls too.

I've been on phone conferences where someone's asked a question to "you", and no-one has answered because they all think it's someone else who is being addressed. In language classes, a teacher will start by asking "how are you?" to the whole class, and everyone will answer in turn, but on an internet tutorial, the latest joiner appears to assume that the question was addressed to him only, and starts a conversation.

A true "remote" language would have to have a radically different structure, and perhaps people would reject it. What would it be?

Maybe it would just be a matter of collapsing second and third into one. This is already how many IE languages handle politeness. Even in English, we are somewhat familiar with the idea of speaking about the 2nd person in the 3rd person, even if only in posh restaurants and in period drama. "Does sir want to see the menu?" "Is sir ready to order?"

Another option would be to maintain the 1st, 2nd, 3rd distinction, but (and this takes a bit of getting your head round) remove the "you" from the 2nd person singular and require that the person's name is used instead. "Do John want to see the menu?" "Are John ready to order?"

But what would have happened to language if humans had originally evolved in a conference call environment? This really messes with your head.

I suspect we would have had a 3 person distinction -- 1: me, 2: anyone on the call, 3: anyone not on the call -- but augmented by a some manner of direct address vs reference in the 2nd, so that I can ask a question to a person in the call and refer to something someone else in the call said without ambiguity. So that's actually 4 persons. I think. My head hurts.

12 October 2008

You may have heard of the POOLS project, funded by the Leonardo II scheme during 2005-2007. They collected short videos on permissive and open licenses in a number of less-studied languages.

Well, now they're back with POOLS-T. The focus now is on tools, not materials; hence the T. Things are only just starting off, but there's already a lot of material from the first round of work as well as donated videos in a variety of languages.

Gordon Wells made some excellent and very professional videos under the title Scottish Island Voices. These are available in two versions, English and Gaelic, and have been included in the Pools project. You can hear an interview with him regarding the project courtesy of the Irish National Digital Learning Repository.

Finally, I've been playing with some of these videos, trying to figure out how best to use the materials. I've uploaded one of Gordon's sets of Gaelic films to YouTube, and I've been playing around with annotation and subtitling options. You can check them out on my YouTube channel, http://www.youtube.com/user/NiallBeag.

The Pools project currently hosts videos in the following languages:

Basque
Danish
Dutch
English
Gaelic
German
Lithuanian
Romanian
Spanish

Lingua Frankly