By Karthika Swamy Cohen
In his talk titled “Inferring Roles of People in Social Networks Extracted from New York Times Articles” part of the minisymposium on inferring networks from non-network data at the SIAM Annual Meeting, Joel Acevedo-Aviles described the process of predicting the role of people in a network based on two techniques, co-occurrence and neighborhood techniques. He demonstrated that the techniques can predict a person’s role with a good degree of accuracy.
After explaining how data was collected, Acevedo-Aviles went on to demonstrate how features were extracted and models were created. The two techniques were then tested for accuracy and error rate.
Network analysis was performed based on two types of graphs: cocurrence and neighborhood. A cocurrence network connects two people mentioned in the same document: a graph is generated to merge these connected entities across documents. The group used a rule-based system with string matching to record the number of documents in which two entities are mentioned together.
Another type of graph that can be used for this type of analysis is a neighborhood graph, which ties or connects people that appear in a character offset between each other in each document. For example, in this study, entities were considered connected if they appeared within 260 characters of each other in each document.
After network analysis using the above two methods, Acevedo-Aviles’s group used a standard supervised learning approach to extract features and create models. They classified all entities in the network according to their determined roles as politician, sportsperson, entertainer/artist, or other.
The group then ran three classifiers on the data--decision trees, support vector machines, and Naive Hayes. They used cocurrence graphs in one experiment and neighborhood graphs in the second.
They found that combining both graph and text features and using support vector machines achieved the lowest error rate, and hence was the best system to use for this sort of role prediction. The system was seen to perform role classification with a 38.8 percent error rate.
Support vector machines outperformed all other classifiers evaluated and combining text features with graph features resulted in the best performance. Using neighborhood graphs didn't have a significant impact on performance; hence neighborhood graphs can be used to significantly decrease computation time.
Future work would be directed towards expanding and cleaning annotation: for instance the techniques should be able to classify two different individuals with the same name, one significantly more famous and well known than the other. Validating NYT data from other years would be another significant future step. Networks should be characterized so their successful characteristics can be applied successfully on other networks.