Semi-automated exploration and extraction of data in scientific tables

Wednesday, September 26, 2018 - 5:00pm to 6:30pm
Columbia University
New York, NY 10027
United States

Columbia Data Science Institute Industry Innovation Seminars

Ron Daniel, Jessica Cox, Corey Harper

Most of the experimental results reported in scientific articles, and recorded in databases or in supplements to the article, are provided in tables. Unfortunately, the amazing recent progress in natural language understanding is of little help if we want to automatically understand those tables. Tables are, after all, not your grandmother’s natural language. Despite this, we believe significant progress can be made towards the goal of combining tables of related information into larger sets that can be analyzed, visualized, understood, and used as the basis for decisions. Elsevier Labs is prototyping tools to help guide people in the exploration of tables from many articles and the extraction and merging of the data they contain. This talk will show examples of what has been accomplished by manually merging such data. With those as examples of the desired outcomes, we will describe our experiments to duplicate such examples, the work flow in which they operate, and our most recent results.


Ron Daniel is the Director of Elsevier Labs, an R&D group which concentrates on smart content and on the future of scholarly communications. Educated as an electrical engineer, Ron has done extensive work on metadata standards such as the Dublin Core, RDF, and PRISM. Before joining Elsevier in 2010, he worked at a startup that was acquired for its automatic classification technology, and consulted on taxonomy and information management issues for nine years. Ron received his Ph.D. in Electrical Engineering from Oklahoma State University, and was a postdoctoral researcher at Cambridge University and at Los Alamos National Laboratory. Ron is bemused by the way technology reincarnates itself, specifically in the way that parallel implementations of neural networks for machine vision are currently in vogue, just as they were 30 years ago when he was working on them in grad school.

Jessica Cox received her Ph.D. in Biomedical Science with an emphasis in environmental health and nutrition from The Ohio State University. She completed a postdoctoral fellowship at Columbia University, where she researched associations between arsenic metabolism and nutritional biomarkers in vulnerable populations within the United States and Bangladesh. Currently, her research interests lie in research integrity and reproducibility in science. Additionally, Jessica continues to study how researchers can use data analytics, statistics, and programming to transform findings into interesting stories to share with peers and stakeholders.

Corey Harper spent nearly 15 years building digital libraries, administering library systems, and managing library metadata. He has held metadata librarian positions at both New York University and the University of Oregon where his research focused on linked data, digital repositories, and library discovery. His current research interests include natural language processing, machine learning, predictive analytics, and data visualization with applications toward issues around research communications. In addition, he is involved in both the Digital Public Library of America (DPLA) and code4libcommunities. Corey has an MBA from NYU's Stern School of Business and an MSLS from the University of North Carolina.

Back to Top