The purpose of this post is to discuss the methodology of classifying paragraphs of documents, where the document is only assigned one topic.
In my usage, a document is an article that is assigned a Section (Business, Economics / Finance, Science, Europe, Middle East, etc). Of course each article can have multiple topics discussed. For Example; a discussion on Ukraine today might discuss the economics of its exchange rate along with European Union and Russian Military. I’m assuming each paragraph discusses on particular topic, and I’m interested in each which paragraphs discuss economics.
The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency
##create documents - each document consists of 100 paragraphs
documents=lapply(1:100, function(x) rnorm(100))
##create topics for each paragraph - 100 documents with a 100 paragraphs each
paragraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))
## a document consists of several paragraphs but the document assigned the most common topic
document_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)
unlisted_documents=unlist(documents)
unlisted_class=rep(document_classes, each=100)
##the original correct classficiation rate is around 54%
table(unlist(paragraph_classes),unlisted_class)
## unlisted_class
## FALSE TRUE
## 0 2643 2353
## 1 2257 2747
sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)
## [1] 0.539
##build a model. each paragraph response variable is the original class assigned
glm=glm(unlisted_class~unlisted_documents, family="binomial")
#the predicted classification using model on data - can see classification results are over 14% greater than original classficiations
predicted_values=scale(predict(glm))
table(predicted_values>0,unlist(paragraph_classes)>.5)
##
## FALSE TRUE
## FALSE 3386 1647
## TRUE 1610 3357
sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)
## [1] 0.6743