Johannes Haupt remerge.io

Streaming newline-delimited JSON data in R (Yelp challenge)

There are several packages out there to load JSON data into R. I’ve usually used jsonlite and its function fromJSON(), but when I tried to load the data provided for the Yelp data challenge, it didn’t work and gave me the error:

Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage

Looking at the actual .txt file we see that the data is actually stored in newline delimited json (ndjson) which is used for record-by-record reading of data and data streaming. There actually is a function hidden in jsonlite to read this data in and it’s called stream_in().

Since it processes the input data line-by-line, we’ll have to open a connection to the file that we want to load.

# do the same with the file with reviews
connection_reviews <- file("./data/reviews_ed.txt")

We can then use function stream_in() to process the file. Loading a subset of the data I created with 46000 reviews takes a few seconds.

library(jsonlite)
# Use stream_in instead of fromJSON
reviews <- stream_in(connection_reviews)
## opening file input connection.
## 
 Found 500 records...
 Found 1000 records...
 Found 1500 records...
 Found 2000 records...
 Found 2500 records...
 Found 3000 records...
 Found 3500 records...
 Found 4000 records...
 Found 4500 records...
 Found 5000 records...
 etc.

 Imported 46019 records. Simplifying...
## closing file input connection.
str(reviews, 1)
## 'data.frame':    46019 obs. of  10 variables:
##  $ review_id  : chr  "H7eJZ9azd1eH5minOhc-uw" "fTLZIeehPoPe_vv8NVS51g" "k5aSkCKZ7jhZcD3a5cy_Ag" "qjn_nLwosOpcjxFsQ2Tmsw" ...
##  $ user_id    : chr  "VRVCKQhYDCkzaEDce8GEtQ" "SxV1Jq7UANuSYpn42JXvOA" "soDF6mePh1SuNZI3rN7HPQ" "LURC3E0DoXYgN9aYTF3XOg" ...
##  $ business_id: chr  "-3pfhzz9CB7F2DpbF1Ko7Q" "-3pfhzz9CB7F2DpbF1Ko7Q" "-3pfhzz9CB7F2DpbF1Ko7Q" "2PqCZxon6AZHJrQ5iam4LA" ...
##  $ stars      : int  5 1 3 2 1 2 4 4 4 3 ...
##  $ date       : chr  "2008-07-06" "2010-04-15" "2015-03-10" "2012-02-04" ...
##  $ text       : chr  "This Bar Restaurant is a wide open airy space within the Apex |International Hotel.We booked a table for Fathers Day Lunch. I h"| __truncated__ "Attached to the hotel my parents were staying in, Metro was restaurant of choice for a quick - but classy - meal before they we"| __truncated__ "Had dinner here in december last year, it was close to my friends apt and we just went there because we didn't want to go far.\"| __truncated__ "It's my local chippy, and I stopped by there after arriving on the last train back from St Andrews, the other night. It did the"| __truncated__ ...
##  $ useful     : int  0 0 2 1 1 0 0 0 0 0 ...
##  $ funny      : int  0 0 2 0 0 0 0 0 1 0 ...
##  $ cool       : int  0 0 2 0 0 0 0 1 1 0 ...
##  $ type       : chr  "review" "review" "review" "review" ...
# Great, we have the $stars rating and review $text
head(reviews$text, 2)
## [1] "This Bar Restaurant is a wide open airy space within the Apex |International Hotel.We booked a table for Fathers Day Lunch. I have to say the food was excellent, the service was very good & value for money on the Sunay Lunch menu was excellent. The soup was piping hot, the Roast beef melted in the mouth! Didn't have room for puds but they looked fantastic"                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] "Attached to the hotel my parents were staying in, Metro was restaurant of choice for a quick - but classy - meal before they were to return home a few hours later.\n\nMetro is, I admit - a beautiful restaurant, modern, clean and bright. But my compliments will end there.\n\nThe meals were expensive, somewhat tasteless, tiny and my soup starter left an odd coating in my mouth for the rest of the meal. The staff were rude and unhelpful, bearly conceling thier bordom and hate for thier jobs. Service was so slow that I counted my dad checking his watch more than once with fear for missing the plane in two hours time. Not the best.\n\nNow, I am willing to admit, we may have just been there at the wrong time or something, but still, it was an awful experience which I will not want to repeat."

Text mining

Because the data is loaded already, let’s see how we would process that. Working with natural language or natural language processing usually requires extensive data cleaning and preparation of the raw data. The preparation steps include

  • defining words splits (tokenization) and terms (n-grams)
  • standardization of the words, e.g. through typo dictionaries and stemming/lemmatization
  • filtering of stop words and punctuation, maybe also based on a pre-defined dictionary

These steps and the word lists/dictionaries depend on the language and the goal of the analysis. For example, it is probably a bad decision to remove punctuation including smileys from tweets.

We will be using package tm to do these preprocessing tasks, the standard text mining package in R (at the moment) although I’ll check out tidytext when I find the time. The package has several convenient functions and and structure to work efficiently on large corporae: An object of text elements can be defined as a Corpus object and function tm_map() (see base function Map()) can consequently be used to map any function to each document within the corpus.

# We will use package tm (text mining)
library(tm)
## Loading required package: NLP
# Package tm provides infrastructure to read in text from 
# different source, e.g. a folder of files, an XML file, 
# or a vector, as in this case
review_texts <- reviews$text[1:100]
reviews_source <- VectorSource(review_texts)
# After we have specified our vector as a source,
# we can read the reviews into a corpus, which contains
# the text and meta data for each document
corpus <- SimpleCorpus(reviews_source)

# Ultimately, we aim to break the texts up into small pieces called tokens,
# words or terms, and count how often these appear in each document.
# Since very similar words come in different forms, e.g. pizza/Pizza/pizzas/pizza!,
# we standardize the text elements before counting.

# Transform all letters to lower-case
# Function tm_map applies the function argument to all documents in the corpus
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove all punctuation characters
replaceCharacter <- content_transformer(function(x, pattern, replacement)
    gsub(pattern = pattern,replacement = replacement, x))
corpus <- tm_map(corpus, replaceCharacter, "'", "")
corpus <- tm_map(corpus, replaceCharacter, "[[:punct:]]", " ")
# Here we make a decision to separate pizza-place into pizza place
# but wouldn't into wouldnt

# Reduce all whitespace to one and delete line breaks, etc.
corpus <- tm_map(corpus, stripWhitespace)
# Remove words without semantic content, like 'and' or 'it'
# You see that stopwords are language specific 
head(stopwords("english"), 10)
##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Reduce all words to their word stem 
# Lemmatization would require non-R dictionaries, e.g. TreeTagger 
corpus <- tm_map(corpus, stemDocument, "english")
# Here we choose to ignore that there are French and German reviews in the corpus

# We can now caluclate the document-term matrix
# During the process, we ignore terms that occur in less than five reviews
# because we expect them to be relevant for only a minor number of observations
dtm <- DocumentTermMatrix(corpus, control = list(bounds = list(global = c(5,Inf))))
inspect(dtm[1:10, 1:20])
## <<DocumentTermMatrix (documents: 10, terms: 20)>>
## Non-/sparse entries: 40/160
## Sparsity           : 80%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs bar didnt fantast food good look lunch say servic soup
##   1    1     1       1    1    1    1     2   1      1    1
##   10   0     0       0    0    1    0     0   0      0    0
##   2    0     0       0    0    0    0     0   0      1    1
##   3    0     1       0    1    1    0     0   0      1    0
##   4    0     0       0    0    1    1     0   0      0    0
##   5    1     2       0    0    0    1     0   0      0    0
##   6    1     1       0    0    1    0     0   1      0    0
##   7    0     0       1    0    0    0     0   0      1    0
##   8    0     0       0    0    0    0     0   0      0    0
##   9    0     0       0    0    0    0     0   1      0    0