Research Projects


Twitter is full of posts tagged #BigData and #DataScience. Which are the ones that people pay attention to most? In a project for Synergic Partners (now integrated within LUCA, Telefonica Data Unit), this team used network science and text-mining techniques to identify Twitter influencers in data science. They built a projected network by combining “retweet” and “mention” layers into a single layer and discovered communities using the K-Clique, Modularity, Random Walk and Mixed Membership Blockmodel community detection algorithms. They identified community influencers using centrality metrics and characterized users and communities using LDA. With a limited dataset of less than 200,000 tweets, they found that the modularity and random walk techniques produced the most coherent communities based on user demographics and influencers. An interactive visualization showed each community’s network and user demographics.Students: Casey Huang, Claire Liu, Jordan Rosenblum and Steven Royce. The team's use of the random walk and modularity algorithms independently picked out the above community of 5,115 Twitter users, largely concentrated in France. These techniques, along with others the team analyzed, can help identify influencers within communities and those who bridge communities and are thus effective targets for a marketing campaign.  
Understanding legal precedents is critical to prosecutors and defense lawyers in plotting their strategy at trial and predicting the trial's outcome. In a project for Synergic Partners (now integrated within LUCA, Telefonica Data Unit), a consulting firm in Spain, this team used case law from the United Nations Office on Drugs and Crime (UNODC) to develop an interactive system for analyzing the data. They demonstrated that the dashboard they developed could be connected to an external database to pull in additional information to complement the UNODC’s data. The team used Python, Kibana, Elastic Search and Shiny toexplore, validate and visualize their data and results.Students: Lin He, Mandeep Singh, Bella Wang and Barbara Welsh.  The interactive dashboard above provides an overview of the 1,940 criminal cases in the UNODC’s database and includes country in which charges were filed and highest court in which the accused faced trial. The filtered results above summarize cases involving criminal intent.  

Goldman Sachs

Patent applications in the United States are classified by technology in a large database but search results often leave out related technologies. In a project for an investment banking firm in New York City, this team used a topic modeling technique called Latent Dirichlet Allocation (LDA) to analyze the text of all utility patents filed in 2014 to infer their underlying themes. They found that their technique nicely complemented the U.S. Patent and Trademark Office’s classification system to provide a fuller picture of overlapping technologies. The team used Python, Kibana, Elastic Search and Shiny to explore, validate and visualize their data and results.Students:  Gabrielle Agrocostea, Francisco Arceo, Abdus Khan, Justin Law and Tony Paek.  The team’s algorithm classified patents into multiple categories, complementing the USPTO's classification system. It picked up some novel categories, including "computer systems," represented by Topic 1 above, which the team discovered had underlying ties to patents related to medicine/cancer and hardware patents (frames, rails, brackets), Topics 4 and 10 respectively.  


The project has considered syntactic analysis of natural languages, with a focus on semi-supervised approaches that require very limited amounts of training data. One focus of the project has been on highly efficient methods for the learning of lexical representations from unlabeled data; these representations can then be used in various natural language processing problems. We have derived a new algorithm for word clustering that is significantly more efficient than previous approaches, and has strong theoretical guarantees. In other work, we have investigated methods for part of speech tagging - the problem of assigning the part of speech to each word in the sentence - using minimal amounts of training data. Our results show that a few hundred words of labeled data are sufficient for high accuracy. A final piece of work has focused on efficient dependency parsing of multiple languages, example applications being machine translation and information extraction.


Discovering patterns in audit logs that capture user access to sensitive data which help identify whether specific access are anomalous, and hence subject to further audit, or not.
This project will provide energy-efficient primitives that will make security cheap and effective on a wide variety of computing platforms, such as ubiquitous mobile phones or high-value installations in military and financial sectors.
CleanOS is a new mobile operating system designed to manage sensitive data rigorously and maintain a clean environment at any point in time.

Health Analytics

Social media sites such as Twitter and Facebook, as well as more specialized sites such as Yelp, host massive amounts of content by users about their real-life experiences and opinions. This effort, in collaboration with the New York City Department of Health and Mental Hygiene (NYC DOHMH), focuses on the detection of disease outbreaks in New York City restaurants. The goal of the project is to identify and analyze the unprecedented volumes of user-contributed opinions and comments about restaurants on social media sites, to extract reliable indicators of otherwise-unreported disease outbreaks associated with the restaurants. The NYC DOHMH analyzes these indicators, as they are produced, to decide when additional action is merited. This project is developing non-traditional information extraction technology --over redundant, noisy, and often ungrammatical text-- for a public health task of high importance to society at large.
Cancer is an individual disease—unique in how it develops and behaves in every patient. Systematic characterization of cancer genomes has revealed a staggering complexity and heterogeneity of aberrations among individuals. More recently appreciated that intra-tumor heterogeneity is of critical importance, each tumor harboring sub-populations that vary in clinically important phenotypes such as drug sensitivity. We use genomic technologies to track tumor response to drug and develop computational machine learning algorithms to piece together an understanding of this data deluge towards personalized cancer care. We methods focus on questions such as (1) Identify the genetic determinants of cancer and drug resistance. (2) Model how these aberrations lead tumor networks to go awry, arming the cancer with ability to abnormally grow, metastasize and evade drugs. (3) Understand what part of the tumor network to target by identifying tumor vulnerabilities and potential synergy of drug combinations. (4) Characterize tumor heterogeneity, including drug resistant and tumor initiating subpopulations. Treatment that is based not only on understanding which components go wrong, but also how these go wrong in each individual patient, will improve cancer therapeutics.
Clinicians in the Neuro-ICU may be confronted daily by over 200 time-related variables for each patient; yet we know from cognitive science that people are only able to understand the relatedness of 2 variables without help. We are investigating how to help clinicians make sense of real-time streams of physiological data as well as of their relationships and trends. The objective of this project is to demonstrate that interactive data visualizations designed to transform and consolidate complex multimodal physiological data into integrated interactive displays will reduce clinician cognitive load and will result in reductions in medical error and improvements in patient care, safety, and efficiency. This project is a collaboration between Dr. J. Michael Schmidt in Neurology, Division of Critical Care and Draper Laboratory. It is funded by the DoD Telemedicine & Advanced Technology Research Center (TATRC) and the Dana Foundation.
Physicians treating patients in the clinic, on the floor, or in the emergency room are faced with an overwhelming amount of complex information about their patients, with little time to review it. HARVEST is an interactive patient record summarization system, which aims to support physicians in their information workflow. It extracts content from the patient notes, where key clinical information resides, aggregates and presents information through time. HARVEST is currently deployed at NewYork-Presbyterian hospital. It relies on a distributed platform for processing data as they get pushed into the electronic health record. We are now investigating summarization models of patient records that identify their co-morbidities and their status through time, by modeling all observations in the record, from the notes to laboratory test measurements and other structured information like billing codes. This project is a collaboration between Dr. Noémie Elhadad in Biomedical Informatics, Dr. Chris Wiggins in Applied Physics and Applied Mathematics, and NewYork-Presbyterian hospital.

Data, Media & Society

This project aims at using NLP to analyze large amounts of textual and speech data (an in particular interactive data) to find relations among people, and between people and propositions (such as sentiment or belief), and to identify when such relations change in an unexpected manner.
The enormous growth in the number of official documents - many of them withheld from scholars and journalists even decades later - has raised serious concerns about whether traditional research methods are adequate for ensuring government accountability. But the millions of documents that have been released, often in digital form, also create opportunities to use Natural Language Processing (NLP) and statistical/machine learning to explore the historical record in very new ways.
"The Listening Machine - Sound Source Organization for Multimedia Understanding" is an NSF-funded project at LabROSA concerned with separating and recognizing acoustic sources in complex, real-world mixtures.

Smart Cities

Augmented Reality for Urban Visualization
The objective of this project is to effectively combine the qualities of different sensor types of a dynamic monitoring network to capitalize on the intrinsic redundancies of the measured data to identify the structural model parameters. Currently there is increasing activity in the area of structural health monitoring using newly emerging, dynamic sensor technologies. There is, however, no clear framework to best combine these heterogeneous measurement quantities for health monitoring purposes. In this project, this dual parameter and state estimation problem with different types of sensor measurements is formulated as a nonlinear estimation problem. In this study, the challenges that will be addressed in dealing with this nonlinear dual state and parameter estimation problem are: 1) the implementation of the approach to large structural problems with many unmeasured states and parameters to be identified and 2) determining the required sensor configurations and resolution to ensure "observability" such that the measured quantities are, indeed, useful and usable for this nonlinear estimation problem. The theoretical developments and the proposed identification approach will be experimentally validated with the laboratory model of a building structure and also with a leveraged data set from a major long-span bridge collected by the principal investigator. This study is expected to provide a validated approach to maximize the return on the use of the heterogeneous sensor networks and an important practical tool to the structural engineering community for better health monitoring, management and maintenance of critical civil infrastructure system with improved life safety. The PI has an industry/agency outreach plan, and will rapidly introduce the dual state-parameter estimation concepts in a graduate course under development. The project will also provide advanced training to graduate and undergraduate students through their direct involvements in this project.
Discharge of wastewater, sewerage and runoff from coastal cities remains the dominant sources of coastal zone pollution. The impervious nature of modern cities is only exacerbating this problem by increasing runoff from city surfaces, triggering combined sewer overflow events in cities with single-pipe wastewater conveyance systems and intensifying urban flooding. Many coastal cities, including US cities like Seattle, New York and San Francisco, are turning to urban green infrastructure (GI) to mitigate the city's role in coastal zone pollution. Urban GI, such as green roofs, green streets, advanced street-tree pits, rainwater gardens and bio-swales, introduce vegetation and perviousness back into city landscapes, thereby reducing the volume and pollutant loading of urban runoff. Urban GI, however, also has co-benefits that are equally important to coastal city sustainability. For example, increasing vegetation and perviousness within city boundaries can help cool urban environments, trap harmful air-borne particulates, increase biodiversity and promote public health and well-being. Despite the significance of these co-benefits, most current urban GI programs still focus on achieving volume reduction of storm water through passive detention and retention of rainfall or runoff. Holistic approaches to GI design that consider multiple sustainability goals are rare, and real time monitoring and active control systems that help ensure individual or networked GI meet performance goals over desired time-scales are lacking. Furthermore, how city inhabitants view, interact with, and value GI is little studied or accounted for in current urban GI programs. This project will develop and test a new framework for the next generation of urban GI that exploits the multi-functionality of GI for coastal city sustainability, builds a platform for real-time monitoring and control of urban GI networks, and takes account of the role of humans in GI stewardship and long-term functionality. The project will use the Bronx River Sewershed in New York City, where a $20 million investment in GI is planed over the next 5-years, as its living test bed. GI has its roots in several disciplines, and the project brings together expertise from these disciplines, including civil and environmental engineering, environmental science, and plant science/ horticulture. In addition, the project integrates expertise from other disciplines needed to elevate GI performance to the next level, including urban planning and design, climate science, data science, environmental microbiology, environmental law and policy, inter-agency coordination, community outreach and citizen science.The specific outcomes of the project will include: (i) new, scientific data on the holistic, environmental performance of different GI interventions in an urban, coastal environment; (ii) new models for the system level performance of networks of GI interventions; (iii) methodologies for projecting GI performance under a changing climate; (iv) a platform for remote monitoring and control of GI; (v) proposals for law and policy changes to enable US coastal cities to introduce GI at scales necessary to meet sustainability goals, and (vi) new understanding of human-GI interactions and their role in the long-term performance and maintenance of urban GI. Engagement with schools in the Bronx River Sewershed and engagement of citizens in the GI performance monitoring are both important components of the project work. The interdisciplinary project team integrates academic expertise with expertise in industry, government and non-profit organizations.
Eco-feedback systems for Building Energy Efficiency
This study focuses on the use of strong motion data recorded during earthquakes and aftershocks to provide a preliminary assessment of the structural integrity and possible damage in bridges. A system identification technique is used to determine dynamical characteristics and high-fidelity first-order linear models of a bridge from low level earthquake excitations. A finite element model is developed and updated using a genetic algorithm optimization scheme to match the frequencies identified and to simulate data from a damaging earthquake for the bridge. Here, two criteria are used to determine the state of the structure. The first criteria uses the error between the data recorded or simulated by the calibrated nonlinear finite element model and the data predicted by the linear model. The second criteria compares relative displacements of the structure with displacement thresholds identified using a pushover analysis. The use of this technique can provide an almost immediate, yet reliable, assessment of the structural health of an instrumented bridge after a seismic event. Copyright © 2011 John Wiley & Sons, Ltd.
The Role of Distributed Infrastructure in Future Cities
In this project we are developing Energy-Harvesting Active Networked Tags (EnHANTs). EnHANTs are small, flexible, and energetically self-reliant devices that can be attached to objects that are traditionally not networked (e.g., books, furniture, walls, doors, toys, keys, clothing, and produce), thereby providing the infrastructure for various novel tracking applications. Examples of these applications include locating misplaced items, continuous monitoring of objects (items in a store, boxes in transit), and determining locations of disaster survivors.Recent advances in ultra-low-power wireless communications, ultra-wideband (UWB) circuit design, and organic electronic harvesting techniques will enable the realization of EnHANTs in the near future. In order for EnHANTs to rely on harvested energy, they have to spend significantly less energy than Bluetooth, Zigbee, and IEEE 802.15.4a devices. Moreover, the harvesting components and the ultra-low-power physical layer have special characteristics whose implications on the higher layers have yet to be studied (e.g., when using ultra-low-power circuits, the energy required to receive a bit is significantly higher than the energy required to transmit a bit).The objective of the project is to design hardware, algorithms, and software to enable the realization of EnHANTs. This interdisciplinary project includes 5 PIs in the departments of Electrical Engineering and Computer Science at Columbia University with expertise in energy-harvesting devices and techniques, ultra-low power integrated circuits, and energy efficient communications and networking protocols.
Back to Top