-
Notifications
You must be signed in to change notification settings - Fork 5
/
mdnotes.txt
150 lines (118 loc) · 3.86 KB
/
mdnotes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
%%%%%%%%%%%%%%%
% GENERAL POINTS
1. Task page:
http://alt.qcri.org/semeval2014/task5/
2. Github repo:
https://github.com/iucl/l2-writing-assistant
3. SemEval Schedule
x Task proposals due September 15, 2013
x Tasks chosen/merged October 15, 2013
x Trial data ready October 30, 2013
Training data ready December 15, 2013
Test data ready March 10, 2014
Evaluation start March 15, 2014
Evaluation end March 30, 2014
Paper submission due April 30, 2014 [TBC]
Paper reviews due May 30, 2014 [TBC]
Camera ready due June 30, 2014 [TBC]
SemEval workshop August 23-24, 2014
%%%%%%%%%%%%%%%
% LEARNER CORPORA
%%%%%
On miller (nlp.indiana.edu):
1) /Volumes/Data/en/NUCLE/
* NUS Corpus of Learner English
* error-tagged
* 1,400 essays written by university students at the National University of Singapore on a wide range of topics
2) /Volumes/Data/en/icle/
* International Corpus of Learner English
*
3) /Volumes/Data/en/EFCAMDAT/
* EF-CAMbridge open language DATabase (EFCAMDAT):
* http://corpus.mml.cam.ac.uk/efcamdat/
* "The resource currently contains over 30 million words in written assignments across 16 teaching levels"
%%%%%
Learner corpora around the world:
http://www.uclouvain.be/en-cecl-lcworld.html
- The GICLE corpus (German component of ICLE)
%%%%%%%%%%%%%%%
% THOUGHTS/QUESTIONS
1. When learners code-switch, are they more likely to do so in
phrases, or do they have good amounts of word-for-word switches?
- Hunch: the L2 is more likely to be word-for-word, so translating
back to the L1 may be easier, in some sense.
2. When do we need to register?
http://alt.qcri.org/semeval2014/index.php?id=registration-1
3. Which of these NUCLE errors are most important to analyze? Would
it make a difference as to what type of code-switching/error we
have? (e.g., collocational vs. idiomatic vs. single-word?)
4. Similar questions for EFCAMDAT ...
5. How we can treat ourselves like a bilingual dictionary? That is
what learners are lacking, and that is our role (so to speak). But
there are complications to being a simple dictionary:
a) there are many possible word/phrase senses
b) learners may not know exactly what word/phrase they need in the
L2, i.e., they may have misused some word or other (e.g., just
because I say "run down" doesn't mean that's the phrase that's
needed in the L2)
c) there may be other errors in the context
Q: would it help to view the dictionary as corrupt and perhaps to
derive it, in part, from learner mistakes? e.g., we could fake
transfer errors and then use those to train some type of system?
%%%%%
NUCLE error types
TAG ERROR TYPE
--------------------------
Vt Verb tense
Vm Verb modal
V0 Missing verb
Vform Verb form
SVA Subject-verb-agreement
ArtOrDet Article or Determiner
Nn Noun number
Npos Noun possesive
Pform Pronoun form
Pref Pronoun reference
Wcip Wrong collocation/idiom/preposition [7312]
Wa Acronyms
Wform Word form [2241]
Wtone Tone
Srun Runons, comma splice
Smod Dangling modifier
Spar Parallelism
Sfrag Fragment
Ssub Subordinate clause
WOinc Incorrect sentence form
WOadv Adverb/adjective position
Trans Link word/phrases
Mec Punctuation, capitalization, spelling, typos
Rloc Local redundancy
Cit Citation
Others Other errors
Um Unclear meaning (cannot be corrected)
%%%%%
EFCAMDAT error types
x >> y change from x to y
AG agreement
AR article
CO combine sentences
C capitalization
D delete
EX expression of idiom
HL highlight
I(x) insert x
MW missing word
NS new sentence
NWS no such word
PH phraseology
PL plural
PO possessive
PR preposition
PS part of speech
PU punctuation
SI sungular
SP spelling
VT verb tense
WC word choice
WO word order
%%%%%%%%%%%%%%%