A New Way To Reason
Guest Post by SERENA LOTRECK
MSU SciComm Blog Contest winner
How can we know, what we don’t know?
My tenth grade history teacher loved this refrain. “You can’t know what you don’t know!” I never paid it more attention than to think it was a funny saying; I never thought about whether or not it was true. If my brain and I were left to our own devices, it might be, but this witty saying was invented long before the existence of Knowledge Graph Reasoning, which has allowed us to fill in many of those unknowable areas.
Knowledge Graph Reasoning is the process of predicting missing connections in a knowledge graph; this is a way of figuring out exactly what it is that we don’t know. But what is a knowledge graph? A knowledge graph is simply a set of things or ideas, called entities, and connections, called relations, between them. The term Knowledge Graph was coined by Google in 2012, and it’s what powers the Google search engine today. However, the idea of representing information as a graph can be traced back to classical Artificial Intelligence research in the 1980’s. The entities and relations in a graph are extracted from human-written text, which can be anything: in the case of the Google Knowledge Graph, this is mostly from structured web pages. However, we can also think about extracting information from other sources, such as scientific papers, which is where this tool starts to become extremely exciting from a researcher’s perspective.
When scientists want to start a new research project, we spend a lot of time looking for articles that have already been published on our topic of interest. The scientific literature, as this body of articles is called, is widely viewed as our collective knowledge: a collection of all the things we know or suspect, based on rigorous experimentation and evidence, to be true. So the first step in any research project is to learn everything we can about our chosen topic, and then, we want to figure out what it is that we (as a collective) don’t know. This last step is called hypothesis generation: where we generate ideas about things that might be true, but we haven’t studied yet.
For example, if I were interested in researching how trees’ branch patterns are determined, I would first search existing publications. Have people studied tree branch patterns before? What have they found? Is branching caused by genetics? By the environment? After reading many papers, I would probably start to notice some patterns about things that aren’t addressed. Maybe no one has said anything about how shade from other trees impacts how many branches a tree has, or in what direction they grow. Maybe there are repeated questions posed in different papers, about something we want to know but can’t figure out. So next, I would search for what I think we don’t know, with keywords like influence of shade on tree branching structure. If I can’t find any established information about how shade influences branching structure, I’d probably turn to my office mate and say, “I found it! The gap in the literature!” This is how we often refer to the things we don’t know: as gaps in the literature. This gap would then be the starting point for my research project.
At this point you’re probably thinking that it seems like we actually do have a pretty good system for figuring out what we don’t know. Read the literature, see what we do know, what we don’t know is what’s not there. However, there’s a catch. In the present day, more than 50 million scientific articles have been published . Think about that: there is at least one scientific article for every single person that lives in the country of Colombia . Even if we narrow down by specific sub-disciplines, there are still too many papers to read. Searching PubMed, a database for scientific articles in the medical and life sciences, for the keywords “flowering time”, which is the study of the determinants of when in the growing season plants start to flower, yields 16,372 articles. This is more than I will read over the course of my entire lifetime in all disciplines combined. I can personally attest to the frustration this overwhelming amount of literature provides: during my undergraduate honors thesis, after finishing up my project, I sat down to write. After many frantic hours holed up in a corner of the library, I stumbled upon a paper that showed that the gap in the literature I thought I had found, had actually already been filled. I was analyzing plants for pesticide absorption; but this paper showed that my target pesticide actually broke down into another compound inside the plants, and that I had been searching for the wrong chemical signature in my samples. This one paper changed how I would have done my entire experiment, and I didn’t find it until it was too late to go back and look again.
So how do we make sure we’ve gotten all the important information before starting a project? How do we find gaps and generate hypothesis that are novel, when there is such an astronomical amount of literature to read? This is where computers come in. We often think of computers as being much smarter than ourselves, but this is mainly because they’re much faster at doing the tasks we give them. Humans are much better at tasks like reading natural (human) language, because we can easily detect context to interpret the meaning of text. However, in recent years we’ve been getting decently good at programming computers to parse the basics of our languages. This field is called Natural Language Processing, or NLP. Even though computers might not be able to understand the meaning of everything they read, they can “read” documents (skim over all the words) much faster than we can. The idea behind Knowledge Graph Reasoning is the following: First, we use NLP tools to “read” the literature -- all of it. Then we use an algorithm to construct a knowledge graph of the information we find. A knowledge graph represents this information in a network, where ideas and things are connected to each other by relationships. Finally, we check this graph using machine learning tools to look for missing links, which is the Reasoning step. These missing links represent predicted information: information that is probably true, based on what we know already. If that sounds familiar, remember what our process for starting a research project looked like! We strive to read as much literature as we could, find missing information, and infer what that information might be, so we can test it with experiments.
Consider the following example, taken from a book about knowledge graph construction. Black lines represent relationships present in the text, and dotted red lines represent possible true relationships.
Image: Mayank Kejriwal, Domain Specific Knowledge Graph Construction
In this example, the relationship “Hillary Clinton is the first lady of the Clinton Administration” is explicitly stated somewhere in the documents we asked our program to read. However, nowhere in the text exists the phrase “Michelle Obama is the first lady of the Obama Administration”. However, the algorithm we use to complete the graph, by performing Reasoning, is able to deduce that, because she is the spouse of the president, she’s also the first lady.
This may seem obvious: any human, after looking at the graph for a few seconds, would make this connection. However, that’s what’s so exciting about Knowledge Graph Reasoning: the kinds of graphs we can generate are so large, that we would never be able to spot these kinds of missing connections. Even more exciting is the idea that many of these connections may not be easily intuitable, even if we could look at a small segment of a larger graph, and that they would go unnoticed if humans were left to read papers the way we always have. Knowledge Graph Reasoning represents an exciting step forward in the basic processes of scientific research, helping researchers to more efficiently find hypotheses to drive experiments. Knowledge Graph Reasoning has the enormous potential to further human knowledge in all areas of science: to aid human health by helping researchers develop novel ways to treat disease; to develop a more sustainable world, by driving research in renewable energy technologies; and to feed our growing population in an ever-changing world, through driving developments in precision agriculture and breeding.
H. Paulheim, “Knowledge graph refinement: A survey of approaches and evaluation methods,” Semantic Web, vol. 8, no. 3, pp. 489–508, Jan. 2017, doi: 10.3233/SW-160218.
A. E. Jinha, “Article 50 million: an estimate of the number of scholarly articles in existence,” Learn. Publ., vol. 23, no. 3, pp. 258–263, 2010, doi: 10.1087/20100308.
M. Kejriwal, Domain-Specific Knowledge Graph Construction. Cham: Springer International Publishing, 2019.
SERENA LOTRECK is a graduate student in the Plant Biology PhD program at Michigan State University.