Thursday, May 22, 2014

didYouMean() Function: Using Google to correct errors in Strings

A function that will take a String as an input and return the "Did you mean.." or "Showing Results for.." from google.com. Good for misspelled names or locations.


library(RCurl)
##if on windows might need: options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
didYouMean=function(input){
  input=gsub(" ", "+", input)
  doc=getURL(paste("https://www.google.com/search?q=",input,"/", sep=""))
  
  
  dym=gregexpr(pattern ='Did you mean',doc)
  srf=gregexpr(pattern ='Showing results for',doc)
  
  
  if(length(dym[[1]])>1){
    doc2=substring(doc,dym[[1]][1],dym[[1]][1]+1000)
    s1=gregexpr("?q=",doc2)
    s2=gregexpr("/&",doc2)
    new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
    return(gsub("[+]"," ",new.text))
    break
  }
  
  else if(srf[[1]][1]!=-1){
    doc2=substring(doc,srf[[1]][1],srf[[1]][1]+1000)
    s1=gregexpr("?q=",doc2)
    s2=gregexpr("/&",doc2)
    new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
    return(gsub("[+]"," ",new.text))
    break
  }
  else(return(gsub("[+]"," ",input)))
}  

So didYouMean("gorecge washington") returns "george washington"


Works well with misspelled companies or nouns or phrases. For example; you're doing text analysis on twitter and a customer raves about Carlsburg beer. Only problem is he's enjoying their product while tweeting (something that happens only rarely, I'm sure) and wrote "clarsburg gprou". Not to worry!

> didYouMean("clarsburg gprou")
[1] "carlsberg group"

Or suppose you have a 3 phase plan for profits. This can help you get there!

didYouMean("clletc nuderpants")
[1] "collect underpants"

5 comments:

  1. This is really cool. Thanks for posting.

    ReplyDelete
  2. How performant is this solution? You mentioned twitter analysis, how long would it take to spellcheck a thousand tweets?

    ReplyDelete
  3. So to do twitter analysis, you'd have to break up the tweet into separate words or phrases. If you google an entire tweet, it generally won't give a "Did you mean..." answer.

    However, once you broke up the tweets, it's not too fast. Give it a try and if you have improvements / suggestions for the code, let me know!

    ReplyDelete
  4. On Windows 8 I had to run:
    options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

    Works like a charm after that.

    ReplyDelete