In his talk titled “Inferring Roles of People in Social Networks Extracted from New York Times Articles” part of the minisymposium on inferring networks from non-network data at the SIAM Annual Meeting, Joel Acevedo-Aviles described the process of predicting the role of people in a network based on two techniques, co-occurrence and neighborhood techniques. He demonstrated that the techniques can predict a person’s role with a good degree of accuracy.
After explaining how data was collected, Acevedo-Aviles went on to demonstrate how features were extracted and models were created. The two techniques were then tested for accuracy and error rate.
Predicting roles based on text and network topology. Image credit: Joel Acevedo-Aviles, AN16 presentation.
The group started with data from the New York Times (NYT) annotate corpus, which includes all articles along with associated metadata published by the NYT between 1987 and 2007. They then classified the roles of all individuals extracted from the data as artist/entertainer, sportsperson, politician, or other based on wiki data.
Network analysis was performed based on two types of graphs: cocurrence and neighborhood. A cocurrence network connects two people mentioned in the same document: a graph is generated to merge these connected entities across documents. The group used a rule-based system with string matching to record the number of documents in which two entities are mentioned together.
Another type of graph that can be used for this type of analysis is a neighborhood graph, which ties or connects people that appear in a character offset between each other in each document. For example, in this study, entities were considered connected if they appeared within 260 characters of each other in each document.
After network analysis using the above two methods, Acevedo-Aviles’s group used a standard supervised learning approach to extract features and create models. They classified all entities in the network according to their determined roles as politician, sportsperson, entertainer/artist, or other.
A rule-based system with string matching is used for entity resolution. Image credit: Joel Acevedo-Aviles, AN16 presentation.
Extractions were then performed on the data based on two types of features—graph features and text features. Graph features included standard metrics like page ranks, betweenness centrality, degree centrality, etc. Text features consisted of metadata from the NYT
, which was used to group individuals into the categories or roles mentioned above.
The group then ran three classifiers on the data--decision trees, support vector machines, and Naive Hayes. They used cocurrence graphs in one experiment and neighborhood graphs in the second.
They found that combining both graph and text features and using support vector machines achieved the lowest error rate, and hence was the best system to use for this sort of role prediction. The system was seen to perform role classification with a 38.8 percent error rate.
Support vector machines outperformed all other classifiers evaluated and combining text features with graph features resulted in the best performance. Using neighborhood graphs didn't have a significant impact on performance; hence neighborhood graphs can be used to significantly decrease computation time.
Future work would be directed towards expanding and cleaning annotation: for instance the techniques should be able to classify two different individuals with the same name, one significantly more famous and well known than the other. Validating NYT data from other years would be another significant future step. Networks should be characterized so their successful characteristics can be applied successfully on other networks.
||Karthika Swamy Cohen is the managing editor of SIAM News.