The purpose of this post is to discuss the methodology of classifying paragraphs of documents, where the document is only assigned one topic.
In my usage, a document is an article that is assigned a Section (Business, Economics / Finance, Science, Europe, Middle East, etc). Of course each article can have multiple topics discussed. For Example; a discussion on Ukraine today might discuss the economics of its exchange rate along with European Union and Russian Military. I’m assuming each paragraph discusses on particular topic, and I’m interested in each which paragraphs discuss economics.
The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency ##create documents - each document consists of 100 paragraphs documents=lapply(1:100, function(x)rnorm(100))##create topics for each paragraph - 100 documents with a 100 paragraphs eachparagraph_classes=lapply(documents, function(x)rbinom(100,size=1,prob=1/(1+exp(-x))))## a document consists of several paragraphs but the document assigned the most common topicdocument_classes=(sapply(paragraph_classes, function(x)sum(x>0))>50)unlisted_documents=unlist(documents)unlisted_class=rep(document_classes, each=100)##the original correct classficiation rate is around 54%table(unlist(paragraph_classes),unlisted_class)
##build a model. each paragraph response variable is the original class assignedglm=glm(unlisted_class~unlisted_documents, family="binomial")#the predicted classification using model on data - can see classification results are over 14% greater than original classficiationspredicted_values=scale(predict(glm))table(predicted_values>0,unlist(paragraph_classes)>.5)