Tuesday, June 7, 2016

Economist Liquidity Index

Introduction

A typical article that discusses economics is awash with words that allude economics connection with water or a liquid. For Example: Markets get saturated when a competing country dumps its products on them Stock prices can get frothy and might develop into a bubble A credit squeeze will dampen expectations and sink prices People often talk of soaking the rich’s slush funds Floating more shares will dilute the value of a stock curiously in my opinion: During a credit drought a mortgage will become underwater and business might need a bailout and of course prices and inflate or deflate It is tempting to see the relation of how money flows through an economy and that prices inflate or deflate as money supply increases or decreases. And there has been serious effort to model money and markets as a flow.

The analogy of money or markets as a liquid that flows has a long history in economics. Francois Quesnay discussed the topic in 1758 as the Tableau economique. Irving Fisher saw markets as the as “the flowing water..”. Machines that model the economy as water called “Phillipes Machines”were built in the 1940’s. Perhaps even Monetarists could be included in this line of thought by considering the equation MV=PQ (money velocity = prices * quantity) by discussing the flow of money through an economy. However, from a theoretical point of view It is hard to reconcile this concept withrecent developments.. but thats another topic.

This post will explore how layman financial / economics articles that discuss the ‘liquidity’ of markets evolve over time using The Economist articles since 1998.

Since articles can discuss multiple topics (politics, regions, science, etc), paragraphs are the basic unit of data (bag of words) and each is assign a probability to whether the paragraph discusses economics.

If a paragraph includes a ‘liquidity’ word (discussed below) then the paragraphs probability of economics will be used as part of the index (to account for possibility that the article is not about economics).

An aggregated score by week is then calculated. Below is the resulting scaled time series of all ‘liquidity’ words along ‘inflat’ and ‘deflat’ (since these are actually technical terms I felt that they should be treated separately from the other liquidity words such as “bubble” in which some economists find that term vacuous).

Of course this is a very messy index, but we do see spikes of “liquidity” index around the financial crises at the end of 2007, a rise in ‘inflat’ index before that date and at least two spikes in ‘deflat’ index around 2003 and 2015.

To smooth out the signal, I made a simple local linear trend model for each index and extracted the states. Below are the time series of the states.

From around 2005 to mid 2007 ‘inflat’ occurred at an increasing rate in Economists articles. You can see this index sink very fast, just as the ‘liquidity index’ reached its height during the financial crises. Finally, it’s noteworthy that the most recent blip of discussion is deflation in 2015.

I thought the liquidity index I created looked like the VIX index so I plotted both scaled series below.

They do appear to increase and decrease at same rate, but not perfectly. But maybe if I’m too busy to read The Economist for a particular week, I can just check the VIX and infer how much they talk about liquidity.

Outline of Code:

Scraped the economist articles Classified which paragraphs discussed business / economics / financial topics.
Applied model to all paragraphs and ended with prob(economics) - find more information on methodology here
Created corpus of words that are affiliated with the concept of ‘liquidity’ or ‘water’ in relation to economics - did this through word2vec and manually seeing which words were associated with liquidity. for instance, i found illiquid was similar to liquid so included that in the corpus. i continued this manual process until i couldn’t find any more words. I thought word2vec could be more precise in this regard and I could just find a “liquidity vector” thats similar to the words I’m looking for but this wasn’t the case. Below are words I found that I thought were appropriate (“soak”,“mop”, “bubble”,“burst”,“pop”,“frothy”,“saturate”, “liquid”,“deflat”,“choppy”,“topsy”,“mop”,“squeeze”,“damp”,“headwinds”,“dilute”,“sink”, “flow”,“inflat”,“solvent”,“ripple”,“pressure”,“expansion”,“untapped”,“reservoir”,“tap”,“funnel”,“fusion”,“pump”,“bail”,“absorb”,“intake”,“draining”,“meltdown”,“bursts”,“implos”,“explos”,“ripple”,“flotations”,“float”,“sink”,“dump”,“circul”,“spew”,“swell”,“pour”,“splashing”,“flood”,“draught”,“dry”,“reservoirs”, “reservoirs”,“fluid”,“droplets”,“dissolve”)
Created an ‘index’ by multiplying the prob(economics)*Indicator(word included) for each paragraph and summing this value by week
Built state-space models for index to smooth values
Found liquidity state sort of follows VIX index.

year_links=function(year){
  url=paste0("http://www.economist.com/printedition/covers?print_region=76980&date_filter%5Bvalue%5D%5Byear%5D=",year)
  doc = htmlTreeParse(url, useInternalNodes = T)
  text = xpathSApply(doc, "//a", xmlValue) 
  src = xpathApply(doc, "//a[@href]", xmlGetAttr, "href")
  links=unlist(src)
  links=links[grep(paste0("printedition/",year),links)]
  links_editions=paste0("http://www.economist.com",links)
  links_editions=unique(links_editions)
  return(links_editions)
}


weekly_links=lapply(1998:2015, year_links)
weekly_links=unlist(weekly_links)
# 
# doc = htmlTreeParse(links_editions[1], useInternalNodes = T)
# src = xpathApply(doc2, "//a[@href]", xmlGetAttr, "href")
# links_week=unlist(src)
# links_week=links_week[grep("news",links_week)]
# 
# 
# 
links_editions=weekly_links[2]
scrape_articles=function(links_editions){
  doc = htmlTreeParse(links_editions, useInternalNodes = T)
  src = xpathApply(doc, "//a[@href]", xmlGetAttr, "href")
  links_week=unlist(src)
  links_week=links_week[unique(c(grep("node",links_week), grep("news",links_week)))]
  links_week=paste0("http://www.economist.com",links_week)
  text1=vector(length=length(links_week))
  for(i in 1:length(links_week)){
    doc= tryCatch(htmlTreeParse(links_week[i], useInternalNodes = T), error=function(e) NULL)
    if(! is.null(doc)) {text = xpathSApply(doc, '//*[contains(concat( " ", @class, " " ), concat( " ", "main-content", " " ))]', xmlValue) }
    if(length(text)!=0) {text1[i]=text}
  }
  return(cbind(edition=rep(links_editions,length(links_week)),link=links_week, text1))
}
articles=lapply(weekly_links,scrape_articles)
articles1=do.call(rbind,articles)

#extract dates
articles1=cbind(articles1,date=gsub("http://www.economist.com/printedition/","", articles1[,1]))
#remove small articles
articles1=articles1[-which(nchar(articles1[,3])<550),]
articles1=cbind(articles1,numeric_date=as.numeric(as.Date(articles1[,"date"])))

library(stringr)

##clean up text /split articles into paragraphs
articles_text=str_replace_all(articles1[,3], "[^[:alnum:]]", " ")
link_articles=rep( unlist(lapply(paragraphs,length)))
paragraphs=strsplit(articles1[,3], "\n")
date_articles=rep(articles1[,"date"],unlist(lapply(paragraphs,length)))
link_articles=rep(articles1[,"link"],unlist(lapply(paragraphs,length)))


##remove small size paragraphs
paragraphs_1=unlist(paragraphs)
paragraphs_2=paragraphs_1[-which(nchar(paragraphs_1)<90)]
link_articles_2=link_articles[-which(nchar(paragraphs_1)<90)]
date_articles_2=date_articles[-which(nchar(paragraphs_1)<90)]
link_articles_2=link_articles[-which(nchar(paragraphs_1)<90)]
paragraph_date=cbind(date_articles_2,paragraphs_2)
economist_paragraph_section=unlist(lapply(strsplit(link_articles_2,"/"), function(x) x[5]))

#create article class
economist_sections=c("united-states","middle-east-and-africa","europe","finance-and-economics","business","britain","books-and-arts","asia","science-and-technology","americas","china","business-and-finance","international")
economist_paragraph_section_target=economist_paragraph_section
economist_paragraph_section_target[-which(economist_paragraph_section_target %in% economist_sections)]=""

paragraphs_3=str_replace_all(paragraphs_2, "[^[:alnum:]]", " ")
paragraphs_3=iconv(paragraphs_3, to='ASCII//TRANSLIT')


##create term document matrix for classification 
it <- itoken(paragraphs_3, 
             preprocess_function = tolower, 
             tokenizer = word_tokenizer, 
             ids = 1:length(paragraphs_3))

vocab <- create_vocabulary(it, stopwords = sw)

# Each element of list represents document
tokens <- paragraphs_3%>% 
  tolower() %>% 
  word_tokenizer()
it <- itoken(tokens, ids = 1:length(paragraphs_3))
vocab <- create_vocabulary(it, stopwords = sw)
it <- itoken(tokens, ids = 1:length(paragraphs_3))
# Or
# it <- itoken(movie_review$review, tolower, word_tokenizer, ids = movie_review$id)
vectorizer <- vocab_vectorizer(vocab)
dtm <- create_dtm(it, vectorizer)


dtm_1=dtm[-which(economist_paragraph_section_target==""),]
economist_paragraph_section_target_1=economist_paragraph_section_target[-which(economist_paragraph_section_target=="")]
economist_paragraph_section_target_2=rep(0,length(economist_paragraph_section_target_1))
economist_paragraph_section_target_2[which(economist_paragraph_section_target_1 %in% c("business","business-and-finance","finance-and-economics"))]=1
word_sums=colSums(dtm_1)
dtm_1=dtm_1[,-which(word_sums<10)]


#build model and predict on all text
library(glmnet)
require(doMC)
registerDoMC(cores=4)

glmnet=cv.glmnet(dtm_1,economist_paragraph_section_target_2,parallel=TRUE,family="binomial",type.measure="auc")
pred_glmnet=predict(dtm[,-which(word_sums<10)], s='lambda.min')
paragraphs_3[order(pred_glmnet,decreasing=TRUE)[1:10]]

pred_glmnet=dtm[,-which(word_sums<10)]%*%coef(glmnet,s='lambda.min')[-1,]+coef(glmnet,s='lambda.min')[1,]
prob_glmnet=1-exp(-pred_glmnet[,1])/(1+exp(-pred_glmnet[,1]))



##words to use
words=unique(c("soak","mop", "bubble","burst","pop","frothy","saturate", "liquid","deflat","choppy","topsy","mop","squeeze","damp","headwinds","dilute","sink",
               "flow","inflat","solvent","ripple","pressure","expansion","untapped","reservoir","tap","funnel","fusion","pump","bail",
               "absorb","intake","draining","meltdown","bursts","implos","explos","ripple","flotations","float" ,"sink","dump","circul",
               "spew","swell","pressure","pour","splashing","flood","draught","dry","reservoirs",
               "reservoirs","fluid","droplets","dissolve"))

business_economics_predictions=prob_glmnet
library(parallel)

#see if word is in a paragraph
is_included=mclapply(words, function(x) grepl(x,paragraphs_3,ignore.case = FALSE), mc.cores=4)
is_included_1=do.call(cbind,is_included)
liquidity_words=rowSums(is_included_1[,-which(words %in% c("inflat", "deflat"))])*business_economics_predictions
inflat_words=(is_included_1[,which(words %in% c("inflat"))])*business_economics_predictions
deflat_words=(is_included_1[,which(words %in% c("deflat"))])*business_economics_predictions
all_words=(is_included_1)*business_economics_predictions
date_articles_3=date_articles_2


##aggregate by week
library(dplyr)
count_dates=as.data.frame(cbind(liquidity_words,inflat_words,deflat_words,date=date_articles_2))
count_dates[,1]=as.numeric(as.character(count_dates[,1]))
count_dates[,2]=as.numeric(as.character(count_dates[,2]))
count_dates[,3]=as.numeric(as.character(count_dates[,3]))
count_dates$prob=business_economics_predictions
data_date <- group_by(count_dates, date)
index_by_date=summarise(data_date, liquidity_words=sum(liquidity_words),inflat_words=sum(inflat_words),deflat_words=sum(deflat_words),prob=sum(prob))
index_by_date_1=as.data.frame(index_by_date)
index_by_date_scaled=index_by_date_1 
index_by_date_scaled[,-1]=apply(index_by_date_scaled[,-1],2,scale) 
index_by_date_melt=melt(index_by_date_scaled[,-5])


library(reshape)
library(ggplot2)

index_plot=ggplot(index_by_date_melt, aes(x=as.Date(date), y=value, group=variable,colour=variable)) +  geom_line()+ scale_x_date()+ theme_classic()+labs(title="Time Series of Indicies",x="Date (by Week)",y="Scaled Value of Index")





code="

data {
int n;
vector[n] y;
real<lower=0> theta1_mean;
real<lower=0> theta1_sd;
}

parameters {
real<lower=0> sigma_v;
real<lower=0> sigma_w;

vector[n] theta_innov;

}

transformed parameters {
vector[n] theta;

theta[1] <- theta1_mean + theta1_sd * theta_innov[1];
for (t in 2:n) {
theta[t] <- theta[t - 1] + sigma_w * theta_innov[t];    
}


}

model {
theta_innov ~ normal(0, 1);
y ~ normal(theta, sigma_v);
}
"

local_linear_trend_state=function(y){
  fit=stan(model_code=code, data=list(y=y, n=length(y), theta1_mean=y[1], theta1_sd=10), iter=2000,chains=3)
  mat=as.matrix(fit)
  theta=mat[,grep("theta", colnames(mat))]
  return(colMeans(theta[,-c(1:ncol(theta)/2)]))
}

index_by_date_1$all_liquidity=rowSums(index_by_date_1[,c("liquidity_words" ,"inflat_words" ,"deflat_words")])
states=lapply(index_by_date_1[,-1],local_linear_trend_state)
states=do.call(cbind,states)
states_1=states
states_1=data.frame(date=index_by_date_1[,"date"], states_1)
states_1[,-1]=apply(states_1[,-1],2,scale)

library(reshape)
states_melt=melt(states_1[,-5])
library(ggplot2)
states_plot=ggplot(states_melt, aes(x=as.Date(date), y=value, group=variable,colour=variable)) +  geom_line()+ scale_x_date()+ theme_classic()+labs(title="Time Series of Indicy States",x="Date (by Week)",y="Scaled Value of Index")

library(Quandl)
library(Quandl)
vix=Quandl("CBOE/VIX")

states_1[,1]=as.Date(states_1[,1])-1
states_2=merge(states_1,vix, by.x="date",by.y="Date", all=TRUE)
states_2[,1]=as.factor(states_2[,1])
states_2[,-1]=apply(states_2[,-1],2,scale)
states_3_melt=melt(states_2[-which(is.na(states_2[,2])),c("date","liquidity_words","VIX Close")])

Thursday, May 12, 2016

When the Predictions are more accurate than the Response

The purpose of this post is to discuss the methodology of classifying paragraphs of documents, where the document is only assigned one topic.

In my usage, a document is an article that is assigned a Section (Business, Economics / Finance, Science, Europe, Middle East, etc). Of course each article can have multiple topics discussed. For Example; a discussion on Ukraine today might discuss the economics of its exchange rate along with European Union and Russian Military. I’m assuming each paragraph discusses on particular topic, and I’m interested in each which paragraphs discuss economics.

The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.

##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency 

##create documents - each document consists of 100 paragraphs 
documents=lapply(1:100, function(x) rnorm(100))

##create topics for each paragraph - 100 documents with a 100 paragraphs each
paragraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))

## a document consists of several paragraphs but the document assigned the most common topic
document_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)

unlisted_documents=unlist(documents)
unlisted_class=rep(document_classes, each=100)

##the original correct classficiation rate is around 54%
table(unlist(paragraph_classes),unlisted_class)

##    unlisted_class
##     FALSE TRUE
##   0  2643 2353
##   1  2257 2747

sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)

## [1] 0.539

##build a model. each paragraph response variable is the original class assigned
glm=glm(unlisted_class~unlisted_documents, family="binomial")


#the predicted classification using model on data - can see classification results are over 14% greater than original classficiations
predicted_values=scale(predict(glm))
table(predicted_values>0,unlist(paragraph_classes)>.5)

##        
##         FALSE TRUE
##   FALSE  3386 1647
##   TRUE   1610 3357

sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)

## [1] 0.6743

Monday, January 11, 2016

Are Turbocharged Engines More Fuel Efficient?

Introduction:

My previous post discussed methods to uncover effects of a particular explanatory variable on a response variable in machine learning models. These work by changing the variable of interest and measure how much the change in output, is keeping all other variables constant. However, this assumption that we can hold constant other variables is almost surely incorrect. In most applications, changing one variable changes the distributions of other covariates as well.

In this post I will show the effect of a change in values by simulating a dataset, keeping only variables of interest fixed while allowing all other variables to change. I will do this by simulating data points from a method described here. I'll compare this method with existing techniques.

The topic I'm going to explore is whether turbocharged engines actually decrease fuel economy. Over the past few years turbocharging is increasingly used for engines that reduce engine size while increasing power and fuel economy and reducing emissions. It has been hailed as the solution to increasing power while decreasing fuel consumption. A common theme among proponents is that one can achieve V-6 power through turbocharging I-4 engines and thereby achieve I-4 fuel consumption. However, real world driving has suggested that a turbocharged option doesn't necessarily reduce consumption, it merely gives you the option to reduce consumption if you don't use all the power.

Data:
The car data was scraped from car and driver website and included 416 car reviews with "observed mpg" along with 15 other variables including; weight, horsepower, torque, zero to 60 time, etc.

I'll compare how observed fuel economy compares with a turbo I-4 and naturally aspirated V-6 with the same 0 to 60 time (6.5 seconds). The two comparisons are:

Turbo Regime: {Engine Breathing=Turbo, Engine_Type=I-4, Zero.60=6.5, Date=Most recent}

Naturally Aspirated Regime: {Engine Breathing=Naturally Aspirated, Engine_Type=V-6, Zero.60=6.5, Date=Most recent}

I picked these examples because changing from V-6 to a I-4 is a very common choice and 0-60 of 6.5 is a fairly standard time for a mid range family sedan.

My previous post discussed the methodology of using everything as constant. In this case I will predict each observation twice under the differences (one with turbo 4 and one with naturally aspirated v-6) and compare the predictions.

f(mpg.observed | other covariates, Turbo Regime) - f(mpg.observed | other covariates, Naturally Aspirated Regime)

Where f() is approximated by a machine learning method. In this case it is a random Forest. Below is a histogram of the results.

Using this method, the mean change in observed mpg is -.028. In addition there is a wide variation in changes (1st Qu is -.077 and 3Qu is 0.025). This results appear that turbos have much of an effect on fuel consumption.

However, it's clear that changing the engine from a I-4 Turbo to a V-6 naturally aspirated engine could change other variables as well. For instance, I-4 engines are lighter than V-6 engines (2 fewer cylinders after all) so its possible that changing from Turbo Regime to Naturally Aspirated regime decreases fuel consumption through other channels.

From this post, we can simulate the distributions:

P(mpg.observed, other covariates | Turbo Regime) and P(mpg.observed, other covariates | Naturally Aspirated Regime). We can the see the differences in distributions with these simulations.

Below is a pairs plot of several variables under each of the different regimes.

The mean mpg observed under I-4 Turbo regime is 25.57 and is 24.29 for V-6 indicating that I-4 Turbos are, on average, 1.28 mpg more fuel efficient than similar performing V-6. It would appear that this is due, at least partially, to weight decrease of a turbo engine. That is, turbo engines are associated with lighter cars (on average by 150 pounds) and that reduces fuel consumption.

Conclusion:
This post compared existing methods of predicting changes with one that simulates the distribution under different conditions. The resulting distribution allows other covariates to change an expected amount given a change in other variables. Using existing methods found no change in average fuel consumption because the gain in fuel efficiency isn't directly caused by turbocharging an engine but through other channels like decrease in weight.

Code and Data

Sunday, January 3, 2016

Sampling Arbitrary data

Introduction:

Generating data usually requires a variance - covariance matrix and is therefore restricted by using a linear assumption between the variables. However, using a linear assumption between data can miss important non - linear relationships. This post uses quantile random forests to simulate an arbitrary dataset.

MCMC Sampling:
MCMC sampling is a method that can simulate from any multivariate densities as long as one has the full conditionals. Full conditionals in this case are models of a particular variable given all other variables. For more concreteness; suppose we have a dataset with n variables (x1, x2, ... xn). The estimated full conditionals in this case are:

f(x1 | x2, x3...xn)

f(x2 | x1, x3...xn)

f(xn | x1, x3...x(n-1))

Where f() is a machine learning algorithm of choice that can provide a distribution of values, in this case I use quantregForest() for continuous variables and randomForest() for categorical variables.

Once the models are built the algorithm is as follows:

- Choose random observation as starting point

-for each iteration of N iteration

- for each variable

- Sample proposal observation from predicted distribution and compare likelihood with current observation. Accept proposal with min(p(proposal)/p(current),1)

Code can be found here.

Results:

A random forest model was built to predict whether the simulated data could be distinguished from original data. The resulting KS test (out of sample) was insignificant so we retain null that model can't distinguish between original and simulated data.

While that is a good result and shows that a model can't distinguish between simulated and original data, visually we can see a difference between the two datasets shown below.

Original Iris Data

Simulated Iris Data Set

The simulated dataset shows sharper boundaries between points that are not found in original dataset. For example, the scatterplot between Sepal.Length and Sepal.Width shows a sharp boundary at approximately 5 Sepal.Length in simulated data but it more gradual in original data.

Conclusion:
This post discussed a method to simulate an arbitrary dataset. While I cannot build a model to distinguish the two datasets, visually they are distinguishable.

***Updated with code from Dmytro Perepolkin - Thanks!

Sunday, December 6, 2015

Beyond Beta: Relationships between partialPlot(), ICEbox(), and predcomps()

Machine Learning models generally outperform standard regression models in terms of predictive performance. However, these models tend to be poor in explaining how they achieve a particular result. This post will discuss three methods used to peak inside "black box" models and the connections between them.

Ultimately the methods discussed here adhere to a similar principle: change the input and see out the output changes.

Suppose we have a model y=f(u,v) and we're interested in the effect of u on the output of the function keeping v constant. These methods change u while keeping other variables, v, constant. For a particular observation x_i; predict with all unique values of u while keeping x_i constant.

for q in (unique values in u);
fhat(q)=predict(q,v)

Now we have a pair of data of what the predicted value of y over the set of unique values of u keeping everything else constant.

Below is a plot that shows what happens when we plot these predicted values, changing u while keeping other variables, v, constant. (The code was taken from from David Chudzicki's Blog Post). It is a logistic regression with two variables. (Example was chosen to demonstrate the connections between the functions described here. One generally wouldn't use these techniques with a logistic regression. These are more appropriate for complex ensemble or hierarchal models.) The plot shows what the prediction is if we kept one variable constant while the other variable changed. In this case the variable Price changes while keeping everything else constant.

If we repeat this procedure for every single observation, we can see what ICEbox package does. Below are two plots; 1) ICE plots using predcomps code calculations and ggplot 2) ICE plots using ice() function:

The above plots show that the two functions have similar calculations.

partialPlot() in randomforest takes the mean at a particular unique value. partialPlot is basically an aggregated value of ICEbox. In this case the line in ICEbox chart (bold and colored) is the equivalent to partialPlot() results. There are many cases where interaction effects are masked by taking the average effect, so visualizing ICE curves is more informative than partiaPlot().

The goal of predcomps() isn't necessarily to visualize how variables are related, it's to find something similar to a "beta" coefficient for complex models. That is, it takes the average change in prediction / change in variable of interest (u). Following the example from the first plot, this means its the average slope of all predicted, unique data points relative to the original data point. The resulting graph is shown below.

Above we see the original data point has a price of 104. It calculates the weighted average (the weight is the mahalanobis distance between the points) slope in regards to this point (lines shown as factor 0). To obtain the overall predcomps "beta" it does this for all observations (or as many as selected by the user).

This post started out predicted n samples from one observation to create a single observation plot over range of interest u. Over many observations this gives us a ICE plots. aggregating together gives us a partialplot, and finding the (weighted) slope in regards to original data gives us the predcomps beta.

Code

Resources:
predcomps website
Average Predictive Comparisons Paper
Peeking Inside the Black Box: VisualizingStatistical Learning with Plots of IndividualConditional Expectation

sweissblaug