A function that will take a String as an input and return the "Did you mean.." or "Showing Results for.." from google.com. Good for misspelled names or locations.
library(RCurl)
##if on windows might need: options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
##if on windows might need: options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
didYouMean=function(input){
input=gsub(" ", "+", input)
doc=getURL(paste("https://www.google.com/search?q=",input,"/", sep=""))
dym=gregexpr(pattern ='Did you mean',doc)
srf=gregexpr(pattern ='Showing results for',doc)
if(length(dym[[1]])>1){
doc2=substring(doc,dym[[1]][1],dym[[1]][1]+1000)
s1=gregexpr("?q=",doc2)
s2=gregexpr("/&",doc2)
new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
return(gsub("[+]"," ",new.text))
break
}
else if(srf[[1]][1]!=-1){
doc2=substring(doc,srf[[1]][1],srf[[1]][1]+1000)
s1=gregexpr("?q=",doc2)
s2=gregexpr("/&",doc2)
new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
return(gsub("[+]"," ",new.text))
break
}
else(return(gsub("[+]"," ",input)))
}
So didYouMean("gorecge washington") returns "george washington"
Works well with misspelled companies or nouns or phrases. For example; you're doing text analysis on twitter and a customer raves about Carlsburg beer. Only problem is he's enjoying their product while tweeting (something that happens only rarely, I'm sure) and wrote "clarsburg gprou". Not to worry!
> didYouMean("clarsburg gprou")
[1] "carlsberg group"
Or suppose you have a 3 phase plan for profits. This can help you get there!
didYouMean("clletc nuderpants")
[1] "collect underpants"
Works well with misspelled companies or nouns or phrases. For example; you're doing text analysis on twitter and a customer raves about Carlsburg beer. Only problem is he's enjoying their product while tweeting (something that happens only rarely, I'm sure) and wrote "clarsburg gprou". Not to worry!
> didYouMean("clarsburg gprou")
[1] "carlsberg group"
Or suppose you have a 3 phase plan for profits. This can help you get there!
didYouMean("clletc nuderpants")
[1] "collect underpants"
This is really cool. Thanks for posting.
ReplyDeleteHow performant is this solution? You mentioned twitter analysis, how long would it take to spellcheck a thousand tweets?
ReplyDeleteSo to do twitter analysis, you'd have to break up the tweet into separate words or phrases. If you google an entire tweet, it generally won't give a "Did you mean..." answer.
ReplyDeleteHowever, once you broke up the tweets, it's not too fast. Give it a try and if you have improvements / suggestions for the code, let me know!
On Windows 8 I had to run:
ReplyDeleteoptions(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
Works like a charm after that.