NIST Big Data Program

 

Welcome to NIST Big Data Public Working Group (NBD-PWG)!
Search  
 Home
 NBD-WG/Subgroups
   Charter
   Co-Chairs
   Guidelines
   All WG Meeting

 Documents
   Version 2 Final Docs
   Version 1 Final Docs
   Docs Repository
   Use Cases Listing
   Upload Document

 Registration
   New User
   Update Profile

 Points of Contact
   Wo Chang
     NIST / ITL
     Digital Data Advisor
     
   James St Pierre
     NIST / ITL
     Deputy Director
     
 

Use Cases and Requirements -- General + Reference
Use Cases V1.0 Submission [Requirements: UseCase | Summary | General | Gen+Ref | Gen+Ref+Gaps | Gen+Detail ]
(click M0180 to download full package and M0203 for full high-level use case descriptions)


Data Sources
General Requirement
1. needs to support reliable real time, asynchronize, streaming, and batch processing to collect data from centralized, distributed, and cloud data sources, sensors, or instruments
(28: M0078, M0090, M0103, M0127, M0129, M0140, M0141, M0147, M0148, M0157, M0160, M0160, M0162, M0165, M0166, M0166, M0167, M0172, M0173, M0174, M0176, M0177, M0183, M0184, M0186, M0188, M0191, M0215)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0129 Gaps:
A big question is how to use cloud computing to enable better use of climate science`s earthbound compute and data resources. Cloud Computing is providing for us a new tier in the data services stack —a cloud-based layer where agile customization occurs and enterprise-level products are transformed to meet the specialized requirements of applications and consumers. It helps us close the gap between the world of traditional, high-performance computing, which, at least for now, resides in a finely-tuned climate modeling environment at the enterprise level and our new customers, whose expectations and manner of work are increasingly influenced by the smart mobility megatrend.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0147 Gaps:
Preserve data for a long time scale.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0183 Gaps:
Translation across diverse and large datasets that cross domains and scales.

M0184 Gaps:
Translation across diverse datasets that cross domains and scales.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.


2. needs to support slow, bursty, and high throughput data transmission between data sources and computing clusters.
(22: M0078, M0148, M0155, M0157, M0162, M0165, M0167, M0170, M0171, M0172, M0174, M0176, M0177, M0184, M0185, M0186, M0188, M0191, M0209, M0210, M0219, M0223)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0170 Gaps:
Development of machine learning tools for data exploration, and in particular for an automated, real-time classification of transient events, given the data sparsity and heterogeneity. Effective visualization of hyper-dimensional parameter spaces is a major challenge for all of us.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0184 Gaps:
Translation across diverse datasets that cross domains and scales.

M0185 Gaps:
Storage, sharing, and analysis of 10s of PBs of observational and simulated data.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


3. needs to support diversified data content ranging from structured and unstructured text, document, graph, web, geospatial, compressed, timed, spatial, multimedia, simulation, instrumental data.
(28: M0089, M0090, M0140, M0141, M0147, M0148, M0155, M0158, M0160, M0161, M0162, M0165, M0166, M0167, M0171, M0172, M0173, M0177, M0183, M0184, M0186, M0188, M0190, M0191, M0213, M0214, M0215, M0223)
M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0147 Gaps:
Preserve data for a long time scale.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0183 Gaps:
Translation across diverse and large datasets that cross domains and scales.

M0184 Gaps:
Translation across diverse datasets that cross domains and scales.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0190 Gaps:
Scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


Transformation
General Requirement
1. needs to support diversified compute intensive, analytic processing and machines learning techniques
(38: M0078, M0089, M0103, M0127, M0129, M0140, M0141, M0148, M0155, M0157, M0158, M0160, M0161, M0164, M0164, M0166, M0166, M0167, M0170, M0171, M0172, M0173, M0174, M0176, M0177, M0182, M0185, M0186, M0190, M0191, M0209, M0211, M0213, M0214, M0215, M0219, M0222, M0223)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0129 Gaps:
A big question is how to use cloud computing to enable better use of climate science`s earthbound compute and data resources. Cloud Computing is providing for us a new tier in the data services stack —a cloud-based layer where agile customization occurs and enterprise-level products are transformed to meet the specialized requirements of applications and consumers. It helps us close the gap between the world of traditional, high-performance computing, which, at least for now, resides in a finely-tuned climate modeling environment at the enterprise level and our new customers, whose expectations and manner of work are increasingly influenced by the smart mobility megatrend.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0170 Gaps:
Development of machine learning tools for data exploration, and in particular for an automated, real-time classification of transient events, given the data sparsity and heterogeneity. Effective visualization of hyper-dimensional parameter spaces is a major challenge for all of us.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0182 Gaps:
None.

M0185 Gaps:
Storage, sharing, and analysis of 10s of PBs of observational and simulated data.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0190 Gaps:
Scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.

M0211 Gaps:
Data management (metadata, provenance info, data identification with PIDs), Data curation, Digitising existing audio-video, photo and documents archives.

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0222 Gaps:
Improving analytic and modeling systems that provide reliable and robust statistical estimated using data from multiple sources, that are scientifically transparent and while providing confidentiality safeguards that are reliable and publically auditable.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


2. needs to support batch and real time analytic processing
(7: M0090, M0103, M0141, M0155, M0164, M0165, M0188)
M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.


3. needs to support processing large diversified data content and modeling
(15: M0078, M0089, M0127, M0140, M0158, M0162, M0165, M0166, M0166, M0167, M0171, M0172, M0173, M0176, M0213)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission


4. needs to support processing data in motion (streaming, fetching new content, tracking, etc.)
(6: M0078, M0090, M0103, M0164, M0165, M0166)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.


Capability
General Requirement
1. needs to support legacy and advance software packages (subcomponent: SaaS)
(30: M0078, M0089, M0127, M0136, M0140, M0141, M0158, M0160, M0161, M0164, M0164, M0166, M0167, M0172, M0173, M0174, M0176, M0177, M0183, M0188, M0191, M0209, M0210, M0212, M0213, M0214, M0215, M0219, M0219, M0223)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0136 Gaps:
Processing requirements for even modest quantities of data are extreme. Though the trained representations can make use of many terabytes of data, the primary challenge is in processing all of the data during training. Current state-of-the-art deep learning systems are capable of using neural networks with more than 10 billion free parameters (akin to synapses in the brain), and necessitate trillions of floating point operations per training example. Distributing these computations over high-performance infrastructure is a major challenge for which we currently use a largely custom software system.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0183 Gaps:
Translation across diverse and large datasets that cross domains and scales.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0212 Gaps:

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


2. needs to support legacy and advance computing platforms (subcomponent: PaaS)
(17: M0078, M0089, M0127, M0158, M0160, M0161, M0164, M0164, M0171, M0172, M0173, M0177, M0182, M0188, M0191, M0209, M0223)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0182 Gaps:
None.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


3. needs to support legacy and advance distributed computing cluster, co-processors, I/O processing (subcomponent: IaaS)
(24: M0015, M0078, M0089, M0090, M0129, M0136, M0140, M0141, M0155, M0158, M0161, M0164, M0164, M0166, M0167, M0173, M0174, M0176, M0177, M0185, M0186, M0191, M0214, M0215)
M0015 Gaps:

M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0129 Gaps:
A big question is how to use cloud computing to enable better use of climate science`s earthbound compute and data resources. Cloud Computing is providing for us a new tier in the data services stack —a cloud-based layer where agile customization occurs and enterprise-level products are transformed to meet the specialized requirements of applications and consumers. It helps us close the gap between the world of traditional, high-performance computing, which, at least for now, resides in a finely-tuned climate modeling environment at the enterprise level and our new customers, whose expectations and manner of work are increasingly influenced by the smart mobility megatrend.

M0136 Gaps:
Processing requirements for even modest quantities of data are extreme. Though the trained representations can make use of many terabytes of data, the primary challenge is in processing all of the data during training. Current state-of-the-art deep learning systems are capable of using neural networks with more than 10 billion free parameters (akin to synapses in the brain), and necessitate trillions of floating point operations per training example. Distributing these computations over high-performance infrastructure is a major challenge for which we currently use a largely custom software system.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0185 Gaps:
Storage, sharing, and analysis of 10s of PBs of observational and simulated data.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.


4. needs to support elastic data transmission (subcomponent: networking)
(14: M0089, M0090, M0103, M0136, M0141, M0158, M0160, M0172, M0173, M0176, M0191, M0210, M0214, M0215)
M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0136 Gaps:
Processing requirements for even modest quantities of data are extreme. Though the trained representations can make use of many terabytes of data, the primary challenge is in processing all of the data during training. Current state-of-the-art deep learning systems are capable of using neural networks with more than 10 billion free parameters (akin to synapses in the brain), and necessitate trillions of floating point operations per training example. Distributing these computations over high-performance infrastructure is a major challenge for which we currently use a largely custom software system.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.


5. needs to support legacy, large, and advance distributed data storage (subcomponent: storage)
(35: M0078, M0089, M0127, M0140, M0147, M0147, M0148, M0148, M0155, M0157, M0157, M0158, M0160, M0161, M0164, M0164, M0165, M0166, M0167, M0170, M0171, M0172, M0173, M0174, M0176, M0176, M0182, M0185, M0188, M0209, M0209, M0210, M0210, M0215, M0219)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0147 Gaps:
Preserve data for a long time scale.

M0147 Gaps:
Preserve data for a long time scale.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0170 Gaps:
Development of machine learning tools for data exploration, and in particular for an automated, real-time classification of transient events, given the data sparsity and heterogeneity. Effective visualization of hyper-dimensional parameter spaces is a major challenge for all of us.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0182 Gaps:
None.

M0185 Gaps:
Storage, sharing, and analysis of 10s of PBs of observational and simulated data.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.


6. needs to support legacy and advance programming executable, applications, tools, utilities, and libraries
(13: M0078, M0089, M0140, M0164, M0166, M0167, M0174, M0176, M0184, M0185, M0190, M0214, M0215)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0184 Gaps:
Translation across diverse datasets that cross domains and scales.

M0185 Gaps:
Storage, sharing, and analysis of 10s of PBs of observational and simulated data.

M0190 Gaps:
Scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users.

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.


Data Consumer
General Requirement
1. needs to support fast search (~0.1 seconds) from processed data with high relevancy, accuracy, and high recall
(4: M0148, M0160, M0165, M0176)
M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.


2. needs to support diversified output file formats for visualization, rendering, and reporting
(16: M0078, M0089, M0090, M0157, M0161, M0164, M0164, M0165, M0166, M0166, M0167, M0167, M0174, M0177, M0213, M0214)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.


3. needs to support visual layout for results presentation
(2: M0165, M0167)
M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.


4. needs to support rich user interface for access using browser, visualization tools
(11: M0089, M0127, M0157, M0160, M0162, M0167, M0167, M0183, M0184, M0188, M0190)
M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0183 Gaps:
Translation across diverse and large datasets that cross domains and scales.

M0184 Gaps:
Translation across diverse datasets that cross domains and scales.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0190 Gaps:
Scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users.


5. needs to support high resolution multi-dimension layer of data visualization
(21: M0129, M0155, M0155, M0158, M0161, M0162, M0171, M0172, M0173, M0177, M0179, M0182, M0185, M0186, M0188, M0191, M0213, M0214, M0215, M0219, M0222)
M0129 Gaps:
A big question is how to use cloud computing to enable better use of climate science`s earthbound compute and data resources. Cloud Computing is providing for us a new tier in the data services stack —a cloud-based layer where agile customization occurs and enterprise-level products are transformed to meet the specialized requirements of applications and consumers. It helps us close the gap between the world of traditional, high-performance computing, which, at least for now, resides in a finely-tuned climate modeling environment at the enterprise level and our new customers, whose expectations and manner of work are increasingly influenced by the smart mobility megatrend.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0158 Gaps:
Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks. Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied. CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0179 Gaps:

M0182 Gaps:
None.

M0185 Gaps:
Storage, sharing, and analysis of 10s of PBs of observational and simulated data.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0222 Gaps:
Improving analytic and modeling systems that provide reliable and robust statistical estimated using data from multiple sources, that are scientifically transparent and while providing confidentiality safeguards that are reliable and publically auditable.


6. needs to support streaming results to clients
(1: M0164)
M0164 Gaps:
Analytics needs continued monitoring and improvement.


Security & Privacy
General Requirement
1. needs to support protect and preserve security and privacy on sensitive data
(32: M0078, M0089, M0103, M0140, M0141, M0147, M0148, M0157, M0160, M0162, M0164, M0165, M0166, M0166, M0167, M0167, M0171, M0172, M0173, M0174, M0176, M0177, M0190, M0191, M0210, M0211, M0213, M0214, M0215, M0219, M0222, M0223)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0147 Gaps:
Preserve data for a long time scale.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0171 Gaps:
Analytics needs continued monitoring and improvement.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0190 Gaps:
Scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0211 Gaps:
Data management (metadata, provenance info, data identification with PIDs), Data curation, Digitising existing audio-video, photo and documents archives.

M0213 Gaps:
Indexing, retrieval and distributed analysis. Visualization generation and transmission

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0222 Gaps:
Improving analytic and modeling systems that provide reliable and robust statistical estimated using data from multiple sources, that are scientifically transparent and while providing confidentiality safeguards that are reliable and publically auditable.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


2. needs to support multi-level policy-driven, sandbox, access control, authentication on protected data
(13: M0006, M0078, M0089, M0103, M0140, M0161, M0165, M0167, M0176, M0177, M0188, M0210, M0211)
M0006 Gaps:

M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0103 Gaps:
Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real-time.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0210 Gaps:
Data movement and bookkeeping (file and event level meta-data).

M0211 Gaps:
Data management (metadata, provenance info, data identification with PIDs), Data curation, Digitising existing audio-video, photo and documents archives.


Lifecycle
General Requirement
1. needs to support data quality curation including pre-processing, data clustering, classification, reduction, format transformation
(20: M0141, M0147, M0148, M0157, M0160, M0161, M0162, M0165, M0166, M0167, M0172, M0173, M0174, M0177, M0188, M0191, M0214, M0215, M0219, M0222)
M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.
  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0147 Gaps:
Preserve data for a long time scale.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0162 Gaps:
1.Establishing materials data repositories beyond the existing ones that focus on fundamental data. 2.Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs. 3.Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data. 4.Multi-variable materials data visualization tools, in which the number of variables can be quite high.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.

M0172 Gaps:
Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

M0191 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

M0214 Gaps:
Processing the volume of data in NRT to support alerting and situational awareness.

M0215 Gaps:
1.Big (or even moderate size data) over tactical networks 2.Data currently exists in disparate silos which must be accessible through a semantically integrated data space. 3.Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0222 Gaps:
Improving analytic and modeling systems that provide reliable and robust statistical estimated using data from multiple sources, that are scientifically transparent and while providing confidentiality safeguards that are reliable and publically auditable.


2. needs to support dynamic updates on data, user profiles, and links
(2: M0164, M0209)
M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0209 Gaps:
New statistical techniques for understanding the limitations in simulation data would be beneficial. Often it is the case where there is not enough computing time to generate all the simulations one wants and thus there is a reliance on emulators to bridge the gaps. Techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side.


3. needs to support data lifecycle and long-term preservation policy including data provenance
(6: M0141, M0147, M0155, M0163, M0164, M0165)
M0141 Gaps:
Variety, multi-type data: SQL and no-SQL, distributed multi-source data. Visualisation, distributed sensor networks. Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualised data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

M0147 Gaps:
Preserve data for a long time scale.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0163 Gaps:
Our goal is to contribute to Big 2 Metadata challenge by systematic reconciling between metadata from many complexity levels with ongoing input from researchers from ongoing research process. Current relationship with Richeact is to reach the interdisciplinary model, using meta-grammar itself to be experimented and its extent fully proven to bridge efficiently the gap between as remote complexity levels as semantic and most elementary (big) signals. Example with cosmological models versus many levels of intermediary models (particles, gases, galactic, nuclear, geometries). Others with computational versus semantic levels.

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.


4. needs to support data validation
(4: M0090, M0161, M0174, M0175)
M0090 Gaps:
Semantics (interpretation of multiple reanalysis products); data movement; database(s) with optimal structuring for 4-dimensional data mining.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0174 Gaps:
Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

M0175 Gaps:
Currently, the areas of concern associated with BD/FI with a Cloud Eco-system, include the aggregating and storing of data (sensitive, toxic and otherwise) from multiple sources which can and does create administrative and management problems related to the following:

  • Access control
  • Management/Administration
  • Data entitlement and
  • Data ownership

However, based upon current analysis, these concerns and issues are widely known and are being addressed at this point in time, via the R&D (Research & Development) SDLC/HDLC (Software Development Life Cycle/Hardware


5. needs to support human annotation for data validation
(4: M0089, M0127, M0140, M0188)
M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0188 Gaps:
The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.


6. needs to support prevention of data loss or corruption
(3: M0147, M0155, M0173)
M0147 Gaps:
Preserve data for a long time scale.

M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0173 Gaps:
How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.


7. needs to support multi-sites archival
(1: M0157)
M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.


8. needs to support persistent identifier and data traceability
(2: M0140, M0161)
M0140 Gaps:
For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages


9. needs to support standardize, aggregate, and normalize data from disparate sources
(1: M0177)
M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.


Others
General Requirement
1. needs to support rich user interface from mobile platforms to access processed results
(6: M0078, M0127, M0129, M0148, M0160, M0164)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0127 Gaps:
Data processing pipeline requires human inspection and intervention. Limited downstream data pipelines for custom users. Cloud architectures for distributing entire data product collections to downstream consumers should be investigated, adopted.

M0129 Gaps:
A big question is how to use cloud computing to enable better use of climate science`s earthbound compute and data resources. Cloud Computing is providing for us a new tier in the data services stack —a cloud-based layer where agile customization occurs and enterprise-level products are transformed to meet the specialized requirements of applications and consumers. It helps us close the gap between the world of traditional, high-performance computing, which, at least for now, resides in a finely-tuned climate modeling environment at the enterprise level and our new customers, whose expectations and manner of work are increasingly influenced by the smart mobility megatrend.

M0148 Gaps:
Perform pre-processing and manage for long-term of large and varied data. Search huge amount of data. Ensure high relevancy and recall. Data sources may be distributed in different clouds in future.

M0160 Gaps:
Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

M0164 Gaps:
Analytics needs continued monitoring and improvement.


2. needs to support performance monitoring on analytic processing from mobile platforms
(2: M0155, M0167)
M0155 Gaps:
High throughput of data for reduction into higher levels. Discovery of meaningful insights from low-value-density data needs new approaches to the deep, complex analysis e.g., using machine learning, statistical modelling, graph algorithms etc. which go beyond traditional approaches to the space physics.

M0167 Gaps:
Data volumes increasing. Shipping disks clumsy but no other obvious solution. Image processing algorithms still very active research.


3. needs to support rich visual content search and rendering from mobile platforms
(13: M0078, M0089, M0161, M0164, M0165, M0166, M0176, M0177, M0183, M0184, M0186, M0219, M0223)
M0078 Gaps:
Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

M0089 Gaps:
Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

M0161 Gaps:
The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

M0164 Gaps:
Analytics needs continued monitoring and improvement.

M0165 Gaps:
Search of “deep web” (information behind query front ends) Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value. Link to user profiles and social network data.

M0166 Gaps:
The translation of scientific results into new knowledge, solutions, policies and decisions is foundational to the science mission associated with LHC data analysis and HEP in general. However, while advances in experimental and computational technologies have led to an exponential growth in the volume, velocity, and variety of data available for scientific discovery, advances in technologies to convert this data into actionable knowledge have fallen far short of what the HEP community needs to deliver timely and immediately impacting outcomes. Acceleration of the scientific knowledge discovery process is essential if DOE scientists are to continue making major contributions in HEP. Today’s worldwide analysis engine, serving several thousand scientists, will have to be commensurately extended in the cleverness of its algorithms, the automation of the processes, and the reach (discovery) of the computing, to enable scientific understanding of the detailed nature of the Higgs boson. E.g. the approximately forty different analysis methods used to investigate the detailed characteristics of the Higgs boson (many using machine learning techniques) must be combined in a mathematically rigorous fashion to have an agreed upon publishable result.

Specific challenges:
Federated semantic discovery: Interfaces, protocols and environments that support access to, use of, and interoperation across federated sets of resources governed and managed by a mix of different policies and controls that interoperate across streaming and “at rest” data sources. These include: models, algorithms, libraries, and reference implementations for a distributed non-hierarchical discovery service; semantics, methods, interfaces for life-cycle management (subscription, capture, provenance, assessment, validation, rejection) of heterogeneous sets of distributed tools, services and resources; a global environment that is robust in the face of failures and outages; and flexible high-performance data stores (going beyond schema driven) that scale and are friendly to interactive analytics

Resource description and understanding:
Distributed methods and implementations that allow resources (people, software, computing incl. data) to publish varying state and function for use by diverse clients. Mechanisms to handle arbitrary entity types in a uniform and common framework – including complex types such as heterogeneous data, incomplete and evolving information, and rapidly changing availability of computing, storage and other computational resources. Abstract data streaming and file-based data movement over the WAN/LAN and on exascale architectures to allow for real-time, collaborative decision making for scientific processes.

M0176 Gaps:
HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

M0183 Gaps:
Translation across diverse and large datasets that cross domains and scales.

M0184 Gaps:
Translation across diverse datasets that cross domains and scales.

M0186 Gaps:
The rapidly growing size of datasets makes scientific analysis a challenge. The need to write data from simulations is outpacing supercomputers’ ability to accommodate this need.

M0219 Gaps:
Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

M0223 Gaps:
Scalable realtime analytics over large data streams. Low-latency analytics for operational needs. Federated analytics at utility and microgrid levels. Robust time series analytics over millions of customer consumption data. Customer behavior modeling, targeted curtailment requests.


4. needs to support mobile device data acquisition
(1: M0157)
M0157 Gaps:
1. Real-time handling of extreme high volume of data. 2. Data staging to mirror archives. 3. Integrated Data access and discovery. 4. D ata processing and analysis.


5. needs to support security across mobile devices
(1: M0177)
M0177 Gaps:
Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.


 

Upcoming/Past Events

2nd NIST Big Data Workshop, NIST,
June 1 & 2, 2017
IEEE NBD-PWG Workshop, October 27, 2014
1st NIST Big Data Workshop, NIST, September 30, 2013

NIST is an agency of the
U.S. Commerce Department

Please send feedback to: bigdatainfo@nist.gov
Last updated: January 11, 2017

Privacy Police | Security Notices FOIA
Accessibility Statement | Disclaimer