Etymo: A Graphic Search Engine for Data Scientists

By Lina Sorg

Though academic search engines regularly return many pages of results, some are limited in the scope of their offerings. Additionally, citation count often directly affects ranking. As a result, relevant papers sometimes remain hidden from researchers who would benefit from their insight. During a minisymposium at the 2017 SIAM Annual Meeting, held last week in Pittsburgh, Pa., Weijian Zhang of the University of Manchester presented a novel, scalable graphic search engine aimed at resolving limitations of current academic search engines. The focus of the search engine, called Etymo, makes it particularly useful for data scientists.

“When you search something on Google, you will usually find the popular papers—the mainstream papers—but it’s hard to find papers from a different perspective,” Zhang said. “It’s very hard to find the papers that are interesting but have low citations.” He focused his efforts on citation counts, due to their sizable influence on search results. “Citations are not only affected by the paper’s scientific content, but also by other factors, such as the journal in which a paper appears, the author’s reputation, and social effects,” Zhang continued. “So what if we build a new network that’s much fairer, where everyone has equal say?”

Zhang used that question as inspiration for Etymo, which is currently live and available for use. Etymo functions like a standard web mapping service (i.e., Google Maps), allowing users to zoom in on searches for related papers or zoom out for a broader look at the existing research landscape. Because Etymo centers mainly on data science papers, Zhang searched for “deep learning” and compared the results with Google Scholar, which uses PageRank. Etymo revealed interesting structures within the search, smaller subsects (such as “reinforcement learning”) that might prove valuable. “The engine automatically identifies small clusters within a broad general search,” Zhang said.

He admitted that the concept of constructing a network from data is not new, and has been around for quite some time. For example, researchers have abstracted brain networks from R-fMRI data and the GO network from Gene Ontology annotations. But advances in machine learning have further developed the network-building process. Zhang located millions of research papers, represented as nodes in 3D space. While reading a paper is the best way to assess its importance, one could not possibly manually read and sort through all published research articles. “How about we ask a machine to read all the papers and tell us how the papers are related?” Zhang asked. “Because of the development of machine learning in general, we can build a network from the unstructured data, which was previously impossible.” He clarified that he uses machine learning to build the network, not rank the papers.

Zhang’s process turns each research paper into a high-dimensional vector representation, and uses a cosine similarity to quantify the likeness of two papers. Etymo employs a term frequency (the number times a term appears in a given document) and an inverse document frequency to determine which words carry “weight” and thus act as identifiers. “If a word appears frequently in a document, it’s important,” Zhang said. “But if it’s appearing in many, many documents, it’s not a unique identifier.” He then used temporal information retrieval to build directed graphs. The results are arranged so that similar papers exist close to one other and all papers are layered for efficient zooming in and out.

PageRank techniques displayed for a simple network. A site's PageRank value depends on both the importance and number of connecting links. Image courtesy of Wikimedia Commons.

Zhang then transitioned into a conversation about internet search. PageRank, perhaps the most well-known algorithm for ranking the importance of website pages, was introduced in 1998. He also spoke of Hyperlink-Induced Topic Search (HITS), which functions similarly to PageRank but computes two rankings rather than one: hubs (a broader compilation site) and authorities (a direct scholarly site). “Good authorities are pointed by good hubs, and good hubs point to good authorities,” Zhang said. In essence, an effective hub will lead directly to authoritative pages. And effective authoritative pages will be linked by many other hubs.

“Etymo uses standard information retrieval measures and HITS scores to rank search results,” Zhang said. While Google simply provides a very long list of research, Etymo includes all related papers with high HITS scores. This is important, as most people only closely examine the first page of search results.

Though PageRank is most beneficial for what Zhang calls “unspecified queries” with messy data, Etymo takes a different approach. “We want to build a comprehensive picture around a general search query,” he said. “We think this is very interesting for the homogenous data.” Zhang and his team monitor Etymo’s data and adjust accordingly when necessary. “For example, if a high position paper is ignored by users and gets no clicks, we weaken its connected edges in the subgraph,” he said. Additionally, machine learning-based dimensionality reduction techniques are imperative for the visualization of high-dimensional data. Thus, Zhang compared and contrasted various reduction methods, and found that t-distributed stochastic neighbor embedding yielded the best results.

Ultimately, Etymo serves to build a more complete picture around a broad search and strengthen data scientists’ understanding of specific, current research and the more general research field. Zhang hopes to improve the quality of researchers’ academic searches, as temporal networks can yield more accurate results. Advances in machine learning have made this possible. “Machine learning and network science can offer new opportunities in this network science field,” Zhang said.

Click here for more coverage of the 2017 SIAM Annual Meeting.

Lina Sorg is the associate editor of SIAM News.