Conceptual Summary of Jockers’ Chapter II

  • reading files into R variables
  • removing what is not the actual text (splitting, using tagging/metadata, or some other markers)
  • normalizing text (lowercasing, removing non-letter characters, normalizing spelling variations, removing noise-characters, like short vowels in Arabic)
  • collecting, checking and visualizing simple descriptive statistics (unique words, word frequencies, etc.)

Challenges for Arabic Corpus

  • What is the most difficult text in terms of the richness of its vocabulary?
  • What is the easiest text—in the same terms?
  • What are caveats in evaluating the richness of vocabulary?

Conceptual Summary of Jockers’ Chapter III

  • getting relative statistics on words
  • first graphs

Challenges for Arabic Corpus

  • What are the top ten words in specific Arabic texts?
  • Do Arabic texts differ much in this regard?

Conceptual Summary of Jockers’ Chapter IV

  • saving often used results into variables for easier reuse

Code to try —

check = "word" # replace `word` with anything and run the code
word.v <- which(moby.word.v == check) # find check
length.word = length(word.v)
check.count.v <- rep(NA,length(n.time.v))
check.count.v[word.v] <- 1 # mark the occurrences with a 1
plot(check.count.v, main=paste0("Dispersion Plot of '",check,"' (",length.word," times) in Moby Dick"),
     xlab="Novel Time", ylab=check, type="h", ylim=c(0,1), yaxt='n')