Last week, edX made a splashy spectacle of an announcement about automated essay grading, leaving educators fuming. Let’s rethink their claims.
“Give Professors a break,” the New York Times suggested in this joint press release from edX, Harvard, and MIT. The breathless story weaves a tale of robo-professors taking over the grading process, leaving professors free to kick back their feet and take a nap, and subsequently inviting universities, ever-focused on the bottom-line, to fire all the professors. If I had set out to write an article intentionally provoking fear, uncertainty, and doubt in the minds of teachers and writers, I don’t think I could have done any better than this piece.
Anyone who’s seen their claims published in science journalism knows that the popular claims bear only the foggiest resemblance to academic results. It’s unclear to me whether the misunderstanding is due to edX intentionally overselling their product for publicity, or if something got lost in translation while writing the story. Whatever the cause, the story was cocksure and forceful about auto-scoring’s role in shaping the future of education.
I was a participant in last year's ASAP competition, which served as a benchmark for the industry; the primary result of this, aside from convincing me to found LightSIDE Labs, is that I get email; a lot of email. I've been told that automated essay grading is both the salvation of education and the downfall of modern society. Naturally, I have strong opinions about that, based both on my experience with developing the technology and participating in the contest, as well as in the conversations I've had since then.
Before we resign ourselves to burning the AI researchers at the stake, let's step back for a minute and think about what the technology actually does. Below, I’ve tried to correct the most common fallacies I’ve seen coming both from articles like the edX piece, as well as the incendiary commentary that it provokes.
Nothing will ever puzzle me like the way journalists require machine learning to behave like a human. When we talk about machine learning “reading” essays, we’re already on the losing side of an argument. If science journalists continue to conjure images of robots in coffee shops poring over a stack of papers, it will seem laughable, and rightly so.
To read an essay well, we’ve learned for our entire lives, you need to appreciate all of the subtleties of language. A good teacher reading through an essay will hear the author’s voice, to look for a cadence or rhythm in writing, to appreciate the poetry in good responses to even the most banal of essay prompts.
LightSIDE doesn’t read essays – it describes them. A machine learning system does pour over every text it receives, but it is doing what machines do best – compile lists and tabulate them. Robotically and mechanically, it is pulling out every feature of a text that it can find, every word, every syntactic structure, and every phrase.
If I were to ask whether a computer can grade an essay, many readers will compulsively respond that of course it can’t. If I asked whether that same computer could compile a list of every word, phrase, and element of syntax that shows up in a text, I think many people would nod along happily, and few would be signing petitions denouncing the practice as immoral and impossible.
Take a more blatantly obvious task. If I give you two pictures, one of a house and one of a duck, and asked you to find the duck, would you be able to tell the two apart?
Let’s be even more realistic. I give you two stacks of photographs. One is a stack of 1,000 pictures of the same duck, and one is a stack of 1,000 pictures of the house. However, they’re not all good pictures. Some are zoomed out and fuzzy; others are way too small, and you only get a picture of a feather or a door handle. Occasionally, you’ll just get a picture of grass, which might be either a front lawn or the ground the duck is standing on. Do you think that you could tell me, after poring over each stack of photographs, which one was a pile of ducks? Would you believe the process could be put through an assembly line and automated?
Automated grading isn’t doing any more than this. Each of the photographs in those stacks is a feature. After poring over hundreds or thousands of features, we’re asking machine learning to put an essay in a pile. To a computer, whether this is a pile of ducks and a pile of houses, or a pile of A essays and a pile of C essays, makes no difference. The computer is going to comb through hundreds of features, some of them helpful and some of them useless, and it’s going to put a label on a text. If it quacks like a duck, it will rightly be labeled a duck.
This is the assumption everyone makes about automated grading. Computers can't feel and express; they can only robotically process data. This inevitably must lead to stamping out any hint of humanity from human graders, right?
Well, no. Luckily, this isn't a claim that the edX team is making. However, by not addressing it head-on, they left themselves (and, by proxy, me, and everyone else who cares about the topic) open to this criticism, and haven't done much to assuage people's concerns. I'll do them a favor and address it on their behalf.
Go back to our ducks and houses. As obvious as this task might be to a human, we need to remember once again, that machines aren’t humans. Presented with this task with no further explanation, not only would a computer do poorly at it; it wouldn’t be able to do it at all. What is a duck? What is a house?
Machine learning starts at nothing – it needs to be built from the ground up, and the only way to learn is by being shown examples. Let's say we start with a single example duck and its associated pile of photographs. There will be some pictures of webbed feet, an eye, perhaps a photograph of some grass. Next, a single example house; its photographs will have crown molding, a staircase; but it'll also have some pictures of grass, and some photographs might be so zoomed in that you can't tell whether you're looking at a feather or just some wallpaper.
Now, let's find many more ducks, and give them the same glamour treatment. The same for one hundred houses. The machine learning algorithm can now start making generalizations. Somewhere in every duck's pile, it sees a webbed foot , but it never saw a webbed foot in any of the pictures of houses. On the other hand, many of the ducks are standing in grass, and there's a lot of grass in most house's front lawns. It learns from these examples - label a set of photographs as a duck if there's a webbed foot, but don't bother learning a rule about grass, because grass is a bad clue for this problem.
This problem gets to be easy rather quickly. Let's make it harder and now say that we're trying to label something as either a house or an apartment. Again, every time we get an example, the machine learning model is given a large stack of photographs, but this time, it has to learn more subtle nuances. All of a sudden, grass is a pretty good indicator. Maybe 90% of the houses have a front lawn photographed at one point or another, but since most of the apartments are in urban locations or large complexes, only one out of every five has a lawn. While it's not a perfect indicator, that feature suddenly gets new weight in this more specific problem.
What does this have to do with creativity? Let's say that we've trained our house vs. apartment machine learning system. However, sometimes there are weird cases. My apartment in Pittsburgh is the first floor of a duplex house. How is the machine learning algorithm supposed to know about that one specific new case?
Well, it doesn't have to have matched up this exact example before. Every feature that it sees, whether it's crown molding or picket fence, will have a lot of evidence backing it up from those training examples. Machine learning isn't a magic wand, where a one-word incantation magically produces a result. Instead, all of the evidence will be weighed and a decision will be made. Sometimes, it'll get the label wrong, and sometimes even when it's the "right" decision, there'll be room for disagreement. But unlike most humans, with a machine learning system we can point to exactly the features being used, and recognize why it made that decision. That's more than can be said about a lot of subjective labeling done by humans.
All of the same things that apply to ducks, houses, and apartments apply to essays that deserve an A, a B, or a C. If a machine grading system is being asked to label essays with those categories, then machine learning will start out with no notion of what that means. However, after many hundreds or thousands of essays are exhaustively examined for features, it’ll know what features are common in the writing that teachers graded in the A pile, in the B pile, and in the C pile.
When a special case arrives, an essay that doesn't fit neatly into the A pile or the B pile, we'd have no problem admitting that a teacher has to make a judgment call by weighing multiple sources of evidence from the text itself. Machine learning learns to mimic this behavior from teachers. For every feature of a text - conceptually no different from poring over a stack of photographs of ducks - the model checks whether it has observed similar features from human graders before, and if so, what grade the teacher gave. All of this evidence will be weighed and a final grade will be given. What matters, though, might not be the final grade - instead, what matters is the text itself, and the characteristics that made it look like it deserved an A, or a C, or an F. What matters is that the evidence used is tied back to human behaviors, based on all the evidence that the model has been given.
Every time I talk to a curious fan of automated scoring, I'm asked, "What are the features of good writing? What evidence ought to be used?" This question flows naturally, but the easy answers are thoughtless ones. The question is built on a bad premise. Yes, there are going to be some features that are true in almost all good writing, with connective vocabulary words and transition function words at the start of paragraphs. These might be like webbed feet in photos of ducks - we know they'll always be a good sign. Almost always, though, the weight of any one feature depends on the question being asked.
When I work with educators, I recommend not just that they collect several hundred essays. I ask that they collect several hundred essays, graded thoroughly by trained and reliable humans, for every single essay question they intend to assign. This unique set allows the machine learning algorithm to learn not just what makes "good writing" but what human graders were using to label answers as an A essay or a C essay in that specific, very targeted domain.
This means that we don't need to learn a list of the most impressive-sounding words and call it good writing; instead, we simply need to let the machine learning algorithm observe what humans did when grading those hundreds of answers to a single prompt.
Take, as an example, the word "homologous." Is an essay better if it uses this word instead of the word "same"? In the general case, no; I dare anyone to collect a random sampling of 1,000 essays and show me a statistical pattern that human graders were more likely to grade a random essay a higher grade if they were to make that swap. It's simply not how human teachers behave, it won't show up in statistics, and machine learning won't learn that behavior.
On the other hand, let's say an essay is asking a specific, targeted question about the wing structure of birds, and the essay is being used in a college freshman-level course on biology. In this domain, if we were to collect 1,000 essays that have been graded by professors, a pattern is likely to emerge. The word "homologous" will likely occur more often in A papers than C papers, based on the professors' own grades. Students who use the word "homologous" in place of the word "same" have not singularly demonstrated, with their mastery of vocabulary, that they understand the field; however, it's one piece of evidence in a larger picture, and it should be weighted accordingly. So, too, will features of syntax and phrasing, all of which will be used as features by a machine learning algorithm. These features will only be given weight in machine learning's decision-making to the extent that they matched the behavior of human graders. By this specialized process of learning from very targeted datasets, machine learning can emulate human grading behavior.
However, this leads into the biggest problem with the edX story.
Machine learning is hard. Getting it right takes a lot more help at the start than you think. I don’t contact individual teachers about using machine learning in their course, and when a teacher contacts me, I start out my reply my telling them they’re about to be disappointed.
The only time that it benefits you to grade hundreds of examples by hand to train an automated scoring system is when you’re going to have to grade many hundreds more. Machine learning makes no sense in a creative writing context. It makes no sense in a seminar-style course with a handful of students working directly with teachers. However, machine learning has the opportunity to make massive in-roads for large-scale learning; for lecture hall courses where the same assignment is going out to 500 students at a time; for digital media producers who will be giving the same homework to students across the country and internationally, and so on.
It’s dangerous and irresponsible for edX to be claiming that 100 hand-graded examples is all that’s needed for high-performance machine learning. It's wrong to claim that a single teacher in a classroom might be able to automate their curriculum with no outside help. That’s not only untrue; it will also lead to poor performance, and a bad first impression is going to turn off a lot of people to the entire field.
Look at what I’ve just described. Machine learning gives us a computer program that can be given an essay and, with fairly high confidence, make a solid guess at labeling the essay on a predefined scale. That label is based on its observation of hundreds of training examples that were hand-graded by humans, and you can point to specific, concrete features that it used for its decision, like seeing webbed feet in a picture and calling it a duck.
Let’s also say that you can get that level of educated estimation instantly – less than a second – and the cost is the same to an institution whether the system grades your essay once or continues to give a student feedback through ten drafts. How many drafts can a teacher read to help in revision and editing? I assure you, fewer than a tireless and always-available machine learning system.
We shouldn’t be thinking about this technology as replacing teachers. Instead, we should be thinking of all the places where students can use this information before it gets to the point of a final grade. How many teachers only assign essays on tests? How many students get no chance to write in earlier homework, because of how much time it would take to grade; how many are therefore confronted with something they don’t know how to do and haven’t practiced when it comes time to take an exam that matters?
Machine learning is evidence-based assessment. It's not just producing a label of A, B, or F on an essay; it's making a refined statistical estimation of every single feature that it pulls out of those texts. If this technology is to be used, then it shouldn't be treated as a monolithic source of all knowledge; it should be forced to defend its decisions by making its assessment process transparent and informative to students. This technology isn’t replacing teachers; it’s enabling them to get students help, practice, and experience with writing that the education field has never seen before, and without machine learning technology, will never see.
"Can machine learning grade essays?" is a bad question. We know, statistically, that the algorithms we've trained work just as well as teachers for churning out a score on a 5-point scale. We know that occasionally it'll make mistakes; however, more often than not, what the algorithms learn to do are reproduce the already questionable behavior of humans. If we're relying on machine learning solely to automate the process of grading, to make it faster and cheaper and enable access, then sure. We can do that.
But think about this. Machine learning can assess students' work instantly. The output of the system isn't just a grade; it's a comprehensive, statistical judgment of every single word, phrase, and sentence in a text. This isn't an opaque judgment from an overworked TA; this is the result of specific analysis at a fine-grained level of detail that teachers with a red pen on a piece of paper would never be able to give. What if, instead of thinking about how this technology makes education cheaper, we think about how it can make education better? What if we lived in a world where students could get scaffolded, detailed feedback to every sentence that they write, as they're writing it, and it doesn't require any additional time from a teacher or a TA?
That's the world that automated assessment is unlocking. edX made some aggressive claims about expanding accessibility because edX is an aggressive organization focused on expanding accessibility. To think that's the only thing that this technology is capable of is a mistake. To write the technology off for the audacity of those claims is a mistake.
In my next few blog posts, I'll be walking through more of how machine learning works, what it can be used for, and what it might look like in a real application. If you think there are specific things that ought to be elaborated on, say so! I'll happily adjust what I write about to match the curiosities of the people reading.