Addenda
Conceptual Summary of Jockers’ Chapter II
- reading files into R variables
- removing what is not the actual text (splitting, using tagging/metadata, or some other markers)
- normalizing text (
lowercasing
, removing non-letter characters, normalizing spelling variations, removing noise-characters, like short vowels in Arabic) - collecting, checking and visualizing simple descriptive statistics (unique words, word frequencies, etc.)
Challenges for Arabic Corpus
- What is the most difficult text in terms of the richness of its vocabulary?
- What is the easiest text—in the same terms?
- What are caveats in evaluating the richness of vocabulary?
Conceptual Summary of Jockers’ Chapter III
- getting relative statistics on words
- first graphs
Challenges for Arabic Corpus
- What are the top ten words in specific Arabic texts?
- Do Arabic texts differ much in this regard?
Conceptual Summary of Jockers’ Chapter IV
- saving often used results into variables for easier reuse
Code to try —
check = "word" # replace `word` with anything and run the code
word.v <- which(moby.word.v == check) # find check
length.word = length(word.v)
check.count.v <- rep(NA,length(n.time.v))
check.count.v[word.v] <- 1 # mark the occurrences with a 1
plot(check.count.v, main=paste0("Dispersion Plot of '",check,"' (",length.word," times) in Moby Dick"),
xlab="Novel Time", ylab=check, type="h", ylim=c(0,1), yaxt='n')