30 May 2012

Authentics: long and short pt 6

Well it seems like I spend as much time writing up my figures as generating them, which is a good thing as it's when I'm writing that I'm most actively considering the implications of what I've done so far.
The next set of figures took very little time to generate, and gave me something to think about.  First I copied the code that identifies all types containing uppercase characters, and wrote a revised version that checks if the same type occurs elsewhere in all lowercase.  This gives me two sets of types -- those "not" in lowercase and those "never" in lowercase.  I wanted to examine the difference to see how this affects my previous results.
I then generated the following table for John Buchan's first three Richard Hannay novels (The 39 Steps, Greenmantle and Mr. Standfast):
The 39 StepsGreenmantleMr Standfast
No of tokens44625107409140425
Type:token14%11%10%
Tokens not in lowercase91917082262
Tokens never in lowercase60411931565
never:not66%70%69%

Excellent -- at least two thirds of the types I've been ignoring are still ruled in as candidates, so my previous figures and their conclusions aren't necessarily invalid.

So, with the lists shrinking, I could get a closer look at them.  I've already pointed out the problem with "I" -- it's always in uppercase, so always a candidate for elimination.  I can hard-code it as an exception, but for now I want the code to be as general and language-agnostic as possible -- and besides, it's one word in thousands.

One other language-specific problem that comes up is the fact that nationalities and languages are capitalised in English, but not in other languages, and it's fair to say that most learners of English would be expected to know/learn words like "English", "Scottish" and "Irish", so it's not necessarily right to rule them out.  And of course they would also know "England", "Scotland" and "Ireland", so I'm no longer even sure that ruling out proper nouns leaves us with a more valid measure of difficulty for language learners.
Looking at the list also indicated a problem technology-wise: NLTK's word_tokenize leaves punctuation in with its tokens, so "end." is a different from "end", and any word at the start of a passage of direct speech is likely to be considered a new token.  This skews my type:token ratios slightly, but I can't be sure whether it's statistically relevant.  But it's the question of direct speech that is making me think most.  Consider the real examples (from the 39 Steps) of "'Let" and "'Then".  You can be pretty sure that "let" and "then" will have occurred before this.  But notice that the words with the leading quotemark start with capitals, as direct speech in prose tends to do.  This means they're being eliminated.  So do I really need to account for this?  Is it statistically significant enough to worry about?
If I was doing this as real, serious research, I would need to write something to strip out appropriate punctuation (or, better, find something that someone else has written to do the same thing).

Anyway, so I decided to throw in another book and see how it came out.

In Pride and Prejudice by Jane Austen, only 49% of the "not lowercase" tokens were present in the "never lowercase" set.  Curious, I decided to expand my figures a bit...
The 39 StepsGreenmantleMr StandfastPride & Prejudice
No of tokens44625107409140425138348
No of types646111381139048171
Type:token14%11%10%5.9%
Tokens not in lowercase91917082262669
Tokens never in lowercase60411931565300
never:not66%70%69%45%
never:types9.3%10.5%11.3%3.7%

OK, that last row's something new: the "never:not" ratio was something I calculated to work out how innacurate I'd been previously -- it's not something that offers any meaningful results as such -- so I wanted a measure that constitutes a goal in itself: "never:types" is the ratio of the types in the "never lowercase" category as a proportion of all types.  It's intended as a rough estimate of the density of proper nouns in a text (still incorporating all the inaccuracies previously discussed). It's notable how both the never:not and never:types ratios are so consistent within the three Buchan novels, and yet so different from Austen's writing. You'd probably expect that, though -- personal style and genre affect this greatly (the Hannay novels involve a lot of travel, Pride and Prejudice is restricted in terms of characters and locations).

All this umming and ahhing over proper nouns is getting distracting.  For now, I'll proceed with the "never lowercase" types as a rough estimate.  I'm not really looking for accurate numbers anyway, just proportions, and this should be good enough for now.

I really need to move on and look at the question that was biggest in my mind when I started out: what's the difference between reading a series of books and reading multiple books by the same author?

So it's back to Project Gutenberg to find some likely candidates.  Émile Zola vs Victor Hugo, perhaps....

29 May 2012

Authentic: long v short pt 5

Today was a wee bit frustrating.  I spent a solid chunk of time trying to get the ntlk data module installed, and with it the file english.pickle that would have allowed me to do part-of-speech (POS) tagging.  This would have made it almost trivially easy to eliminate the proper nouns and get a genuine look at the real "words" that are of interest to the learner.

Ah well, it looks like it wasn't meant to be.

So I started working towards custom code to eliminate the proper nouns manually, something which would be handy in the future anyway.  The first step was to identify some candidates for further inspection, and seeing as I'm working with English, that's pretty easy: if it's not all in lower case, there's something funny about it.  I wrote the code to identify all the tokens (words) that contained capitals.  Yes, at this point I could have checked whether it was the start of the sentence or not, but that wouldn't have really helped, because proper nouns occur at the start of sentences too, so i'd still need to check.

When I generated my set of candidates, though, it was a little long.  For The 39 Steps, I was looking at 919 tokens to check manually, and that's a fairly short book.  As I'm doing this for fun, it seemed like checking that many would be a little bit boring, particularly in longer books.  (I later checked the candidate set for the 3 books in total, and it turned out to be over 3000 words, which is more than my time's worth.)

My first quick test then was to have a look at the difference in figures.  Eliminating every single item with any capitals in it drops the type:token ratio in The 39 Steps from 14.48% to 13.14% -- that's almost a a 10% drop (it's 1.35 percentage points, but it's 9.27 percent).  Before properly addressing the proper nouns, I wanted to see how big a difference this crude adjustment makes to the figures.  It seemed just a little too high to realistically be led by proper nouns alone.  But can that be?  I mean, how many words are likely to occur only at the start of sentences?

So on I went, hoping that the data I could generate at this stage would start to shed some light on this figure.

The first graph I produced showed me the running type:token ratios and introduction rates for both the full token set, and the token set with non-lowercase words eliminated:
The two pairs of lines follow each other pretty closely, getting closer together as they progress.  But in order to start getting a clear idea of what was going last time, I had to go to another level of abstraction and measure some useful differences.  So here is the difference between the running ratios for all words and lowercase only, and the corresponding difference in introduction rates:
Now you'd be forgiven for thinking that the difference is diminishing here -- I was fooled into thinking the same thing, but then I realised I was dealing with numbers here rather than proper stats, and I redid the analysis but with a difference in percentage:
The overall running type:token ratio does indeed decrease, but it halves (20% down to 10%) then stabilises.  The introduction rate, on the other hand, is all over the place -- there's no identifiable trend at all.  Even subsampling my data didn't give any clear and understandable trends (and since I'm using a desktop office package for my analysis it's a bit of faff to do the resampling automatically -- it's just further proof that I need to get myself familiar with the statistical analysis tools for Python (eg numpy), but my head's full with the NLTK stuff for now, so I'll leave the improved statistical stuff for
another time).  Here's the same graphs, but with 2000 word samples instead of 500 word samples:

So not promising, really.  Still no stable, identifiable trends.

Books as a series
But I had all the infrastructure in place now, so I figured I might as well rerun the analysis on the 3 books as a single body and see what came out.  Let's just go straight to the relative difference between the lines for all words and eliminating all words not entirely in lower case:
Oooh... now where did I leave those figures on where the individual books started...?  44625 and 152034, and there's a notable period of high difference (20-30%) from about 45000 words, and that massive spike you seem on the graph -- which is actually a 63.64% difference -- occurs from 152000-152500.

Bingo: we've got decent support for Thrissel's suggestion that a lot of proper nouns are introduced early on in... at least some novels.

Not the sort of information I was originally looking for, but actually quite interesting.  It's kind of turning the project in a slightly different direction than I had planned.  I'll just have to go with the flow.

What I did wrong today
One of the minor irritations of the day was when I started writing up my results, and after having done the coding, data generation and analysis, I realised a fairly simple refinement I could have made.  It was a real *palmface* moment: I could have simply taken my first list of candidate proper nouns and eliminated any candidates that also appeared completely in lower case.  Having done that, I would have been left with a much shorter list of candidates, and it may well have been worth my time manually checking the results.

>sigh<

But of course, that's as much the point of the exercise as anything: to work through the process and the problems and to start thinking about what can be done better.

It also occurs to me now that I also managed to eliminate every single occurrence of the word I from the books!  Quite a fundamental error, even if it only made a minute difference to the final ratios.

Perhaps I'm being a little too "hacky" in all this.  I'll have to pick up my game a bit soon....

28 May 2012

Authentics: long v short - pt 4

Well, I had a nice weekend and visited some friends in Edinburgh for a wedding.  The weather's too good to spend too much time inside, so I'll just write up a few more tests then go and enjoy the sunshine.

Multiple books in a series
Today's figures come from two of the books I've already mentioned -- John Buchan's The 39 Steps and Greenmantle, and the next book in the series: Mr Standfast.

Again, the graphs at different sample sizes show different parts of the dataset more clearly than others:
The graph as 1000 words is too unstable to clearly identify the end of the the first book, and the start of the second book is only identifiable because the introduction rate is greater than the running ratio for the first time in any of my tests.

The 2500 and 5000 word samples give us a clear spike for the end of the first book, but the end of the second book is obscured slightly by noise, and becomes clearer again in the 10000 and 25000 word sample sizes, although the end of the first book is completely lost by the time we reach the 25000 word sample graph.

Having done all that, I went back and verified the peaks matched the word counts -- The 39 Steps is 44625 words long, and Greenmantle is 107409 words long, so ends with the 152034th word.  The peaks on the graphs all occur shorlty after 45000 and 150000.

It was the first graph, from the 1000 word sample set, that piqued my curiosity.  Having spotted the two lines intersecting at the start of the third book, I decided to check the difference between the running type:token ratio and the introduction rate, and I graphed that.  At 1000 word samples, there was still too much noise:
However, given that I already knew what I was looking for, I could tell that it showed useful trends, and even just moving up to 2500 word samples made the trends pretty clear:
Going forward, I need to compare the difference between books in a series, books by the same author (but not in a series) and unrelated books, and I believe that the difference between the running type:token ratio and the introduction rate may be the best metric to use in comparing the three classes.

Problems with today's results
I can't rule out that the spikes in new language at the start of each book aren't heavily influenced by the volume of proper nouns, as Thrissel suggested, so I'm probably going to have to make an attempt at finding a quick way of identifying them.  The best way of doing this would be to write a script that identifies all words with initial caps in the original stream, then asks me if these are proper nouns or not.

By treating the three books as one continuous text in the analysis, it looks like I've inadvertently smoothed out the spike somewhat at the start of each book.  In future I should make sure individual samples are taken from one book at a time so that the distinction is preserved.

25 May 2012

Authentics: long vs short part 3

Excuse the slight change of title -- I figured the original long title was probably getting truncated in people's feeds, so I wanted to abbreviate it.  If you've been following my blog recently, you should have already seen my previous two posts on my little project; I am trying to investigate whether my normal advice that long fiction (novels or TV serials) is better than short fiction (short stories and feature films) for the learner.

Sample sizes
Anyway, as I said last time, I wanted to start comparing a fixed length of text, rather than variable-length chapters as my benchmark.  I was looking for a sampling length that would give a clear picture of the overall progression without having too much interference from little local fluctuations.  My first set of results suggests that this is a fool's errand.  The following set of images shows the graphs for the novel Greenmantle by John Buchan, with samples taken ever 1000, 2500, 5000 and 1000 words.
While using larger samples gives a much smoother line, it also unfortunately obliterates some of the most important detail in the graph, in that we start to lose the steep drop at the start -- that's information that's really crucial to my investigation, so I'll have to make put up with various humps and wiggles in the line for now.  However, that's not to say that the other graphs aren't interesting in and of themselves -- the little hump at around 50000-60000 words in the 5000 word sample version suggests that something important may be happening at this point in the story, causing a batch of new vocabulary to be introduced, or perhaps the introduction of a new character with a different style of speech.  Anyway, as interesting as that may be, it would be a diversion from the matter at hand.

Alternatively, I could move away from using linear sampling/projections and start charting using logarithmic or exponential data, and while now would be a good time to start refreshing my memory on that sort of statistical analysis, it also risks diverting me from the task at hand, and I'm following the Coursera.org machine learning course currently, so I should be able to get the computer to do the work itself in a few weeks anyway.  Besides, I've still not got myself a high-frequency word list, and the pattern might be completely different once I've eliminated common words of English from the equation.

So for now I'll stick to working with multiple sample sizes.  I'll admit to being a bit simplistic in my approach to this so far, as I ran my little Python program once for every sample size, rather than just running it once with the smallest sample size then resampling the data.

The program I'm using at the moment is pretty straightforward:

def collect_stats (token_list, step_size):
   i=0
   return_array=[];
   while (i      i += step_size
      running_types_total = number_of_types (token_list[:i])
      if (i         return_array.append([i,running_types_total])
      else :
         return_array.append([len(token_list),running_types_total])
   return return_array
This takes an NLTK token list (it would work with any simple list of strings too, though) and the size of samples to be taken, then builds up a list of lists [[a1,b1],[a2,b2],...] where each a is the number of the last word included in the sample, and each b is the number of unique tokens from the beginning of the text to the ath word.

The number_of_types function just returns len(set(w.lower() for w in token_list)).

This means that at every stage I have a running total of tokens, and it's only when I want to produce a graph that I calculate the number of new tokens in the given slice (= b(n) - b(n-1)), and there's therefore no reason why I can't skip several slices to decrease my sampling rate (eg b(n) - b(n-3)).

Next up
I've taken a running sample of three books from the same series -- The 39 Steps, Greenmantle and Mr Standfast, and run them through as one text, so I'll look at the output of that next, but I don't think it'll be much use until I've got something to compare with -- either/both of: a selection of novels by one author that aren't a series; and a selection of novels by different authors.

24 May 2012

Authentic materials for learners: long form or short form? (pt 2)

Well, as I was saying in part I, I've always claimed novels are easier for the learner than short stories, and I was wanting to back up my claims with some figures.  So for my initial investigation I fired up my copy of the free TextSTAT package and away I went.

I was talking about the type:token ratio last time, and that seemed as good a place as any to start.  I managed to skip the logical first step, which would have been to compare a novel and a short story, but I'll have to come back to that later.

What I started with was an Italian novel, but I got figures that were too high to be useful.  One of the problem with languages such as Italian is that they write some of their clitics in the same written word as the main word (EG "to know (someone)" -> conoscere; "to know me" -> conoscermi), increasing the type:token ratio significantly.  You've also got the problem that it has verb conjugations and it drops subject pronouns in most situations.  Overall Italian (and Spanish and Catalan, among others) would be a bad choice for a demonstration language.  Today, I'm using English as it's a very isolating language -- the only common inflections are past-tense-ed, second-person-present-s and plural-s.  This makes it easy to get a reasonably accurate measure of the lexical variety without any clever parsing.  I will most likely use French at some point too, because while it is not as straightforward as English in that sense (it's got a lot of verb conjugation going on), it doesn't have the same clitics problem as Italian, and the French don't drop their pronouns.
Today's findings: 1 - running ratios
I decided to look at how the type:token ratio changes as a text proceeds.  I wanted to measure this chapter by chapter, counting the types and tokens in chapter 1, then loading the second chapter into the concordance and checking the type:token ratio for chapters 1 & 2 combined, then 1, 2 & 3 etc.  I realised, however that it would be more efficient to load all chapters into memory at the same time and work down from the other end: all chapters, then close the last chapter and take the figures again, then close the second last chapter and take the figures again.
In the end, I got a nice little graph (using LibreOffice) that showed a marked tendency to decreasing type:token ratio as the books progressed:
The x-axis shows the chapter number, the y-axis shows the type:token ratio (remember, this is the type:token ratio for the entire book up to and including the numbered chapter).  Notice how the type:token ratio halves by around the 6th or 7th chapter.

So by one measure, the longer the novel is, the easier it would appear to be.
Today's findings: 2 - introduction rates
I figured I could go a bit deeper into this without generating any new data.  What I wanted to look at now was how much new material was introduced in each chapter -- ie. a ratio of new types to tokens. It's easy enough to do -- I could obtain the number of new types in any given chapter by deleting the running total at the previous chapter from the running total at the current chapter.
The graph I got was even more interesting than the last:
While the running ratio halves after 6 or 7 chapters, the introduction rate halves after only 2-4!  It certainly looks like each chapter will on average be easier than the last.
One curious feature is the large uptick at the end of the children's novel Laddie (green).  This illustrates one quirk that the learner should always bear in mind: kids books are often actually more complicated linguistically than adults' books, as the author on some level seeks to educate or improve the person reading.  The author of this book seems to have kept the language consistently simple through most of the book, but realising he was coming to the end, crammed in as much complexity as possible.

Another curious feature is that the figures claim no new vocabulary is introduced in the fourth chapter of The 39 Steps (yellow).  While this is theoretically possible, its more likely that it's ...ahem... experimenter error, which a quick look at the actual figures verifies: chapters 3 and 4 are listed in my output as being exactly the same length, which is more than a little unlikely.  It looks like I loaded the same chapter twice...
Further analysis
Notice that in both graphs, the figures are the same at chapter one.  This is to be expected, as every type encountered in the first chapter is encountered for the first time in the book (by definition).

So what happens if we stick the running ratio of type:token against the introduction rate of new types?

This:
So while the overall type:token ratio continues to fall notably from the 10th to the 20th chapter, suggesting decreasing difficulty, the introduction rate gets fairly erratic by around the 10th chapter (despite still tending downwards), so perhaps there is a limit after which it is not safe to assume that each chapter is a difficult as the last.

Perhaps the measure of efficiency is related to the difference between the running ratio and the introduction rate, and once that gap starts to narrow, there is no advantage?

Problems with today's findings
This was a first exploratory experiment, so I didn't conduct it with a whole lot of rigour.  Here are the main factors affecting todays results:
  1. I didn't eliminate common words -- it is impossible to see from the figures I have how many of the types introduced at any stages are ones we would expect learners to know already and how many will be genuinely new to them.
  2. When examining Pride and Prejudice and The 39 Steps, I hadn't told the concordancer to ignore case, so anything appearing at the start of a sentence and in the middle would be counted as two types -- eg that and That.  (It was the first time I'd used TextSTAT and I hadn't realised it defaulted to case-sensitive -- I won't make that mistake again.)
  3. The length of chapters varies significantly from book to book and even from chapter to chapter within books, so the lines are not to scale with each other, and each individual line is not in a continuous scale with itself.  The graphs, though presented in a line, are arguably not true line graphs, as they occur from samples arbitrarily dispersed.
Accounting for these problems in the future
  1. There are plenty of frequency lists on the net, so I'll be able to eliminate common words without any real difficulty.
  2. The case sensitivity issue, now that I'm aware of it, will not be a problem.
  3. When I ran the initial data, I was using TextSTAT as my installation of Python and NLTK was playing up (I had too many different versions of Python installed, and some of the shared libraries were conflicting).  I've now got Python to load NLTK without problems, so I can do almost any query I want.  Future queries will be sampled regularly after a specific number of words.
Experiments to carry out
At some point I'm going to want to go back and compare short stories with novels, but for now I'm going to head a little further down the path I'm on.

My first task is to work out a decent sampling interval: ever 1000 words? 5000? 10,000? 50,000?  I'll run a few trials and see what my gut reaction is -- that should be the next post.  (It might even prove that the chapter is the logical division anyway -- after all, it divides subjects, which would indicate different semantic domains...)
I also want to look at what happens when we look at sequels after each other.  Those of you familiar with John Buchan will notice that I've included such a pair as individual novels here -- The 39 Steps and Greenmantle.  I might include initial findings from this next time, as they'll determine my next step.
After this I'll either move on to looking at more pairs of original book + sequel (to look for a generalisable pattern), looking at longer serieses of books (to see if they get continually easier) or comparing book-and-sequel to two different books from the same author (to see if any perceived benefits from reading a book and its sequel are just coincidence and really only because of the author).

Caveat emptor
Remember, though, that this little study is never going to be scientifically rigorous, as I don't really currently have the time to deal with the volume of data required to make it truly representative.  However, it's nice to think how big a job this would have been before computers made this sort of research accessible to the hobbyist.  Many thanks to the guys who wrote the various tools I'm using -- your work is genuinely appreciated.

23 May 2012

Authentic materials for learners: long form or short form? (part I)

Anyone who gets sufficiently far on in learning a language is going to want to start reading, watching or listening to materials intended for natives.  This is what a lot of language teachers refer to as "authentic materials".  (Now, an alarming amount of so-called authentic materials in the classroom are actually very heavily doctored, but that's not the sort of authentic materials I'm going to talk about today.)

I have often claimed that it is better for the learner to use longer materials than short materials (eg when discussing films vs TV serieses in a previous post).  This wasn't an idea I came up with myself, but advice I'd been given when I was at high school, although I can't remember who first suggested it to me.

Anyway, I was told at the time that the first one or two hundred pages of a novel contain most of the language (in terms of grammar, vocabulary and turn-of-phrase) that will occur in the entire book.  It therefore follows that the first two hundred pages of any work are the most difficult, and therefore the longer the book is, the easier the ending will be, because you won't be confused by the language.  This also means that the book is acting as active revision, and that by the end of the book, you will have learnt most of the major vocabulary in it.

A 50 page short story would intuitively sound easier to read than a novel, but this isn't really the case, because you're dealing with something that is going to be littered with new words on every single page.

And what about a piece of flash fiction?  Realistically, we're not going to expect much repetition at all.  Compare with the short extracts of authentic works printed in many classroom language textbooks -- none of the "content words" that are specific to the story are likely to be repeated at all, so they will be looked up by the reader, then promptly forgotten about.

Although I was told this about reading books (as opposed to short stories), I believe this holds for any form of literature, fiction or non-fiction, regardless of medium.

A half-hour documentary will be self-reinforcing in a way that a 4 minute news report on the same topic won't be (on TV or radio).  An 8 hour long TV series similarly will reinforce its language than an 80 minute feature film, or particulary a 10 minute short film.

Anyway, I've been repeating this advice for years, and I've always said that my experience backs it up. Well, in the little gap between finishing my Gaelic course and starting my next job, I was wanting to do a little work with corpus analysis software and it occurred to me that this would be a great little exercise to get me back into the swing of things, so I downloaded several resources: TextSTAT, a concordancer package written in Python at the Free University of Berlin; AntConc, a Linux/Mac/Windows concordancer by Lawrence Antony at Waseda University in Japan; and the Natural Language Toolkit for python, which will allow me to write more flexible, custom queries on my data.

Type:token ratio
One of the most basic statistical measures of diversity in a text is the so-called "type:token ratio".  The number of "tokens" in a text is the number of individual words, the number of "types" is the number of different word forms.

For example, the phrase "the cat chased the dog" has 5 tokens, but only 4 types, because the is only counted once when determining the number of types.

Or again, "I told the woman to tell the man to tell you" has 11 tokens (11 words in the sentence), but as the, to and tell occur twice each, there's only 8 types in the sentence.

The type:token ratio is exactly what you'd expect if you're at all familiar with statistics: the number of types divided by the number of tokens.  In the first example, we have 4:5 = 4/5 = 0.8 (or 80%) and in the second we have 8:11 = 0.727272... (roughly 73%).

Notice how the type:token ratio on the longer sentence is lower than that on the shorter sentence -- in this case it's a matter of my choice of words, but as a general rule, type:token ratio decreases with the length of text examined, which only goes to justify the advice of favouring long-form over short-form materials for the learner.

However, that's still to be proven in practice.

So over the next few weeks, I'll be experimenting with a bunch of public domain texts from Project Gutenberg.  I'll be trying to investigate the basic premise of whether long-form fiction is intrinsically easier than short-form, then investigating whether this extends to reading several books by the same author as opposed to books by different authors, and how much of a difference it makes whether these books are part of a series or individual stories.

The size of this study is going to be very small, as the main goal for me is simply to gain a better understanding of the technology and to reason through the process of designing logically sound research in a corpus, so the conclusions won't be scientific proof of anything, but it will hopefully be interesting (to me at least).

If you're aware of any research that covers the areas I'm looking at, please feel free to drop a reference in the comments, and if you have anything to add or suggest, I'm all ears.

17 May 2012

The dangers of overinterpretation and oversimplification

In one of the forums recently, I came across another person asking the eternal question: is there a "talent" for learning languages?  In other words, are some people just better at learning languages than others?

One of the ideas that came up in the thread was the notion of the "blank slate" theory of psychology, which was essentially taken for granted for a long time, but is increasingly being shown to be overly simplistic through the identification of specialisation and variation in human brains.

The most famous work on this is Steven Pinker's book, which I have never read.  I have, however, watched his TED talk:


Now, setting aside that most TED talks are (as far as I am aware) made by people who have paid a lot of money to stand up on that stage, and therefore often little more than adverts for books, it is an interesting little talk.
One problem I have with it, though, is its failure to comment on the multiple dimensions of "blank slate", because the notion of a blank slate can mean two different things.
First of all, there is the classical notion of the brain as a content-agnostic calculation machine, as put forth by David Hume, and that the human "mind" is merely a collection of learned responses to stimulus.

This theory is thoroughly discredited by modern neuroscience as it is very clear that certain regions of the brain are very specialised, and as scanning technology improves, we're finding more and more specialisation at a finer and finer level.

The second notion is of individual differences -- he talks about the "blank slate" idea as being fashionable egalitarianism: we're all born "the same".  But this isn't the blank slate per se, because there is nothing to say that one blank slate can't be bigger than another, or easier to read than another, or quicker to write on than another.
The "blank slate" he's really talking about is something of a chimera -- he's accepting brain region specialisation as a given, but treating the individual regions as though these were slates.  It's something of a rather confused strawman, but it's deceptively appealing.
So yes, there is brain specialisation and yes, there are individual differences.  This shouldn't be controversial, so really the controversy must be in his extrapolations.  This makes for entertaining pop science, but as a scientific thesis intended to convince others, it's sorely lacking.

The biggest problem I have with Pinker's talk is its failure to state very clearly that its findings are irrelevant and inapplicable in any practical context.  Pinker's conclusions relate to the study of the brain, and it is important that future research accounts properly for the issues he discusses.  However, while the differences in individual brains exist and many abilities are strongly affected by heredity, we have no way of identifying and measuring these inherited differences beforehand.  Hermann Einstein's work as an electrification engineer may give some inkling of his son Albert's future career as a theoretical physicist, but what does it tell you about his daughter Maja's study of Romance languages and literature?

But that wouldn't sell a pop science book.  People want to read these books and then feel like they're part of this "new" knowledge.  They want to go out and apply the theory straight away.  They conclude that they're good at what they're good at simply because "the slate isn't blank".  But Pinker has never once claimed to be able to identify with certainty which abilities arise due to culture and peer group, which he says are important factors in the video itself.  If he can't, how can we? 

For now we must simply get on with teaching the best way possible, rather than trying to second guess the hows and whys.

Edit: How the hell did I manage to describe the father of relativity as a "chemist"?  I think I might be going senile already....