NIST Big Data Program

 

Welcome to NIST Big Data Public Working Group (NBD-PWG)!
Search  
 Home
 NBD-WG/Subgroups
   Charter
   Co-Chairs
   Guidelines
   All WG Meeting

 Documents
   Version 2 Final Docs
   Version 1 Final Docs
   Docs Repository
   Use Cases Listing
   Upload Document

 Registration
   New User
   Update Profile

 Points of Contact
   Wo Chang
     NIST / ITL
     Digital Data Advisor
     
   James St Pierre
     NIST / ITL
     Deputy Director
     
 

Use Cases and Requirements -- Summary
Use Cases V1.0 Submission [Requirements: UseCase | Summary | General | Gen+Ref | Gen+Ref+Gaps | Gen+Detail ]
(click M0180 to download full package and M0203 for full high-level use case descriptions)


No. Use Case Volume Velocity Variety Software Analytics Data
Sources
Transformation Capabilities Data Consumer Security &
Privacy
Lifecycle
Management
Others
1M0147
Census 2010 and 2000
380TBStatic for 75 yearsScanned DocumentsRobust archival storageNone for 75 years 1. large document format from a centralized storage -- 1. large centralized storage (storage) -- 1. Title 13 data 1. long-term preservation of data as-is for 75 years
2. long-term preservation at the bit-level
3. curation process including format transformation
4. access and analytics processing after 75 years
5. needs to make sure no data loss
--
2M0148
NARA: Search, Retrieve, Preservation
Hundred of Terabytes, and growing.data loaded in batches so burstyunstructured and structured data: textual documents, emails, photos, scanned documents, multimedia, social networks, web sites, databases, etc.Custom software, commercial search products, commercial databases.Crawl/index; search; ranking; predictive search. Data categorization (sensitive, confidential, etc.). Personally Identifiable Information (PII) detection and flagging 1. distrubted data sources
2. large data storage
3. bursty data range from GB to hundreds of TB
4. wide variety of data formats including unstructured and structured data
5. distributed data sources in different clouds
1. craw and index from distributed data sources
2. various analytics processing including ranking, data categorization, detect PII data
3. pre-processing of data
4. long-term preservation management fo large varied datasets.
5. hugh amount of data with high relevancy and recall.
1. large data storage
2. various storages such as NetApps, Hitachi, Magnetic tapes
1. high relevancy and high recall from search
2. high accuracy from categorization of records
3. various storages such as NetApps, Hitachi, Magnetic tapes
1. security policy 1. pre-process for virus scan
2. file format identification
3. indexing
4. categorize records
1. mobile search with similar interfaces/results from desktop
3M0219
Statistical Survey Response Improvement
Approximately 1PBVariable. field data streamed continuously, Census was ~150 million records transmitted.Strings and numerical dataHadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pigrecommendation systems, continued monitoring 1. data size approximately one petabyte 1. analytics are required for recommendation systems, continued monitoring and general survey improvement. 1. software includes Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig 1. data visualization for data review, operational activity and general analysis. It continues to evolve. 1. improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.
2. both confidential and secure all data. All processes must be auditable for security and confidentiality as required by various legal statutes.
1. high veracity on data and systems must be very robust. The semantic integrity of conceptual metadata concerning what exactly is measured and the resulting limits of inference remain a challenge 1. mobile access
4M0222
Non-Traditional Data in Statistical Survey Response Improvement
N/AN/ASurvey data, other government administrative data, web scrapped data, wireless data, e-transaction data, potentially social media data and positioning data from various sourcesHadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, PigNew analytics to create reliable information from non traditional disparate sources -- 1. analytics to create reliable estimates using data from traditional survey sources, government administrative data sources and non-traditional sources from the digital economy 1. software includes Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig 1. data visualization for data review, operational activity and general analysis. It continues to evolve. 1. both confidential and secure on all data. All processes must be auditable for security and confidentiality as required by various legal statutes. 1. high veracity on data and systems must be very robust. The semantic integrity of conceptual metadata concerning what exactly is measured and the resulting limits of inference remain a challenge --
5M0175
Cloud Eco-System
N/AReal TimeNoHadoop RDBMS XBRLFraud Detection 1. real time ingestion of data 1. real time analytic essential -- -- 1. strong security and privacy constraints -- 1. mobile access
6M0161
Mendeley
15TB presently, growing about 1 TB/monthCurrently Hadoop batch jobs are scheduled daily. Future is real-time recommendationPDF documents and log files of social network and client activitiesHadoop, Scribe, Hive, Mahout, PythonStandard libraries for machine learning and analytics, LDA, custom built reporting tools for aggregating readership and social activities per document 1. file-based documents with constant new uploads
2. variety of file types such as PDF, social network log files, client activities images, spreadsheet, presentation files
1. standard machine learning and analytics libraries
2. scalable and parallelized efficient way for matching between documents
3. third-party annotation tools or publisher watermarks and cover pages
1. EC2 with HDFS (infrastructure)
2. S3 (storage)
3. Hadoop (platform)
4. Scribe, Hive, Mahout, Python (language)
5. moderate storage (15 TB with 1TB/month)
6. needs to batch and real-time processing
1. custom built reporting tools
2. visualization tools such as networking graph, scatterplots, etc.
1. access controls for who.s reading what content 1. metadata management from PDF extraction
2. identify of document duplication
3. persistent identifier
4. metadata correlation between data repositories such as CrossRef, PubMed and Arxiv
1. Windows Android and iOS mobile devices for content deliverables from Windows desktops
7M0164
Netflix Movie Service
Summer 2012. 25 million subscribers; 4 million ratings per day; 3 million searches per day; 1 billion hours streamed in June 2012. Cloud storage 2 petabytes (June 2013)Media (video and properties) and Rankings continually updatedData varies from digital media to user rankings, user profiles and media properties for content-based recommendationsHadoop and Pig; Cassandra; TeradataPersonalized recommender systems using logistic/linear regression, elastic nets, matrix factorization, clustering, latent Dirichlet allocation, association rules, gradient boosted decision trees and others. Also and streaming video delivery. 1. user profiles and ranking info 1. streaming video contents to multiple clients
2. analytic processing for matching clients. interest in movie selection
3. various analytic processing techniques for consumer personalization
4. robust learning algorithms
5. continued analytic processing based on the monitoring and performance results
1. Hadoop (platform)
2. Pig (language)
3. Cassandra and Hive
4. huge subscribers, ratings, and searching per day (DB)
5. huge storage (2 PB)
6. I/O intensive processing
1. streaming and rendering media?? 1. preservation of users. privacy and digital rights for media 1. continued ranking and updating based on user profile and analytic results 1. smart interface accessing movie content on mobile platforms
8M0165
Web Search
45B web pages total, 500M photos uploaded each day, 100 hours of video uploaded to YouTube each minuteReal Time updating and real time response to queriesMultiple mediaMapReduce + Bigtable; Dryad + Cosmos. PageRank. Final step essentially a recommender engineCrawling; searching including topic based search; ranking; recommending 1. Needs to support distributed data sources
2. Needs to support streaming data
3. Needs to support multimedia content
1. dynamic fetching content over the network
2. linking user profiles and social network data
1. petabytes of text and rich media (storage) 1. search time in ~0.1 seconds
2. top 10 ranked results
3. page layout (visual)
1. access control
2. needs to protect sensitive content
1. purge data after certain time interval (few months)
2. data cleaning
1. mobile search and rendering
9M0137
BC/DR) Within A Cloud Eco-System
Terabytes up to PetabytesCan be Real Time for recent changesMust work for all DataHadoop, MapReduce, Open-source, and/or Vendor Proprietary such as AWS (Amazon Web Services), Google Cloud Services, and MicrosoftRobust Backup -- 1. robust Backup algorithm
2. replicate recent changes
1. Hadoop
2. commercial cloud services
-- 1. strong security for many applications -- --
10M0103
Cargo Shipping
N/ANeeds to become real-time. Currently updated at eventsEvent basedN/ADistributed Event Analysis identifying problems. 1. centralized and real time distributed sites/sensors 1. tracking items based on the unique identification with its sensor information, GPS coordinates
2. real time updates on tracking items
1. Internet connectivity -- 1. security policy -- --
11M0162
Materials Data for Manufacturing
500,000 material types in 1980's. Much growth since thenOngoing increase in new materialsMany datasets with no standardsNational programs (Japan, Korea, and China), application areas (EU Nuclear program), proprietary systems (Granta, etc.)No broadly applicable analytics 1. distributed data repositories for more than 500,000 commerical materials
2. may varieties of datasets
3. text, graphics, and images
1. (100s) of independent variables by collecting these variables to create robust datasets -- 1. visualization for materials discovery from many independent variables
2. visualization tools for multi-variable materials
1. protection of proprietary of sensitive data
2. tools to mask proprietary information
1. how to handle data quality is poor or unknown --
12M0176
Simulation driven Materials Genomics
100TB (current), 500TB within 5 years. Scalable key-value and object store databases needed. Simulations add data regularlyVaried data and simulation resultsMongoDB, GPFS, PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, varied community codesMapReduce and search that join simulation and experimental data. 1. data streams from peta/exascale centralized simulation systems
2. distributed web dataflows from central gateway to users
1. high-throughput computing real-time data analysis for web-like responsiveness
2. mashup of simulation outputs across codes
3. search and crowd-driven with computation backend be flexibly for new targets
4. MapReduce and search to join simulation and experimental data
1. massive (150K cores) of legacy infrastructure (infrastructure)
2. GPFS (General Parallel File Sysem) (storage)
3. MonogDB systems (platform)
4. 10Gb networking
5. various of analytic tools such as PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, varied community codes
6. large storage (storage)
7. scalable key-value and object store (platform)
8. data streams from peta/exascale centralized simulation systems
1. browser-based to search growing material data 1. sandbox as independent working areas between different data stakeholders
2. policy-driven federation of datasets
1. validation and UQ of simulation with experimental data
2. UQ in results from multiple datasets
1. mobile apps to access materials geonics information
13M0213
Large Scale Geospatial Analysis and Visualization
Imagery -- 100s of Terabytes. Vector Data -- 10s of Gigabytes but billions of pointsVectors transmitted in Near Real TimeImagery. Vector (various formats shape files, KML, text streams) and many object structuresGeospatially enabled RDBMS, ESRI ArcServer, GeoserverClosest point of approach, deviation from route, point density over time, Principal Component Analysis (PCA) and Independent Component Analysis (ICA) 1. geospatial data requires unique approaches to indexing and distributed analysis. 1. analytics include Closest point of approach, deviation from route, point density over time, PCA and ICA
2. geospatial data requires unique approaches to indexing and distributed analysis.
1. software includes Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver 1. visualization with GIS at high and low network bandwidths and on dedicated facilities and handhelds 1. sensitive data and must be completely secure in transit and at rest (particularly on handhelds) -- --
14M0214
Object identification and tracking
FMV -- 30-60 frames per/sec at full color 1080P resolution. WALF -- 1-10 frames per/sec at 10Kx10K full color resolution,Real TimeA few standard imagery or video formats.custom software and tools including traditional RDBM.s and display tools.Visualization as overlays on a GIS. Analytics are basic object detection analytics and integration with sophisticated situation awareness tools with data fusion. 1. real-time data FMV – 30-60 frames per/sec at full color 1080P resolution and WALF – 1-10 frames per/sec at 10Kx10K full color resolution. 1. Rich analytics with object identification, pattern recognition, crowd behavior, economic activity and data fusion 1. wide range custom software and tools including traditional RDBM’s and display tools.
2. several network requirements
3. GPU Usage important
1. visualization of extracted outputs will typically be as overlays on a geospatial display. Overlay objects should be links back to the originating image/video segment.
2. output the form of OGC compliant web features or standard geospatial files (shape files, KML).
1. significant security and privacy, sources and methods cannot be compromised the enemy should not be able to know what we see. 1. veracity of extracted objects --
15M0215
Intelligence Data Processing and Analysis
10s of Terabytes to 100s of Petabytes. Individual warfighters (first responders) would have at most 1-100s of GigabytesMuch Real Time. Imagery intelligence device gathering petabyte in a few hoursText files, raw media, imagery, video, audio, electronic data, human generated dataHadoop, Accumulo (Big Table), Solr, Natural Language Processing, Puppet (for deployment and security) and Storm. GISNear Real Time Alerts based on patterns and baseline changes, Link Analysis, Geospatial Analysis, Text Analytics (sentiment, entity extraction, etc.) 1. much of Data real-time with processing at worst near real time
2. data currently exists in disparate silos which must be accessible through a semantically integrated data space
3. diverse data includes text files, raw media, imagery, video, audio, electronic data, human generated data.
1. analytics include NRT Alerts based on patterns and baseline changes 1. tolerance of Unreliable networks to warfighter and remote sensors
2. up to 100.s PB.s data supported by modest to large clusters and clouds
3. software includes Hadoop, Accumulo (Big Table), Solr, NLP (several variants), Puppet (for deployment and security), Storm, Custom applications and visualization tools
1. primary visualizations will be Geospatial overlays (GIS) and network diagrams. 1. data must be protected against unauthorized access or disclosure and tampering 1. data provenance (e.g. tracking of all transfers and transformations) must be tracked over the life of the data. --
16M0177
EMR Data
12 million patients, more than 4 billion discrete clinical observations. > 20 TB raw data0.5-1.5 million new real-time clinical transactions added per day.Broad variety of data from doctors, nurses, laboratories and instrumentsTeradata, PostgreSQL, MongoDB, Hadoop, Hive, RInformation retrieval methods (TF-IDF) Natural Language Processing, maximum likelihood estimators and Bayesian networks. 1. heterogeneous, high-volume, diverse data sources
2. volume: > 12 million entities (patients), > 4 billion records or data points (discrete clinical observations), aggregate of > 20 TB raw data
3. velocity: 500,000 - 1.5 million new transactions per day
4. variety: Formats include numeric, structured numeric, free-text, structured text, discrete nominal, discrete ordinal, discrete structured, binary large blobs (images and video).
5. data evolves over time in a highly variable fashion
6. a comprehensive and consistent view of data across sources, and over time
1. a comprehensive and consistent view of data across sources, and over time
2. analytic techniques: Information retrieval, natural language processing, machine learning decision-models, maximum likelihood estimators and Bayesian networks
1. Hadoop, Hive, R. Unix-based
2. Cray supercomputer
3. Teradata, PostgreSQL, MongoDB
4. various, with significant I/O intensive processing
1. needs to provide results of analytics for use by data consumers / stakeholders - ie, those who did not actually perform the analysis. Specific visualization techniques 1. data consumers may access data directly, AND refer to the results of analytics performed by informatics research scientists and health service researchers.
2. all health data is protected in compliance with governmental regulations.
3. protection of data in accordance with data providers. policies.
4. security and privacy policies may be unique to a subset of the data.
5. robust security to prevent data breaches.
1. standardize, aggregate, and normalize data from disparate sources
2. the needs to reduce errors and bias
3. common nomenclature and classification of content across disparate sources. This is particularly challenging in the health IT space, as the taxonomies continue to evolve - SNOMED, ICD 9 and future ICD 10, etc.
1. security across mobile devices.
17M0089
Pathology Imaging
1GB raw image data + 1.5GB analytical results per 2D image; 1TB raw image data + 1TB analytical results per 3D image. 1PB data per moderated hospital per year,Once generated, data will not be changedImagesMPI for image analysis; MapReduce + Hive with spatial extensionImage analysis, spatial queries and analytics, feature clustering and classification 1. high resolution spatial digitized pathology images
2. various image quality analysis algorithms
3. various image data formats especially BIGTIFF with structured data for analytical results
4. image analysis, spatial queries and analytics, feature clustering and classification
1. high performance image analysis to extract spatial information
2. spatial queries and analytics, and feature clustering and classification
3. analytic processing on huge multi-dimensional large dataset and be able to correlate with other data types such as clinical data, -omic data.
1. legacy system and cloud (computing cluster)
2. huge legacy and new storage such as SAN or HDFS (storage)
3. high throughput network link (networking)
4. MPI image analysis, MapReduce, Hive with spatial extension (sw pkgs)
1. visualization for validation and training 1. security and privacy protection for protected health information 1. human annotations for validation 1. 3D visualization and rendering on mobile platforms
18M0191
Computational Bioimaging
medical diagnostic imaging is annually around 70 PB. A single scan on emerging machines is 32TBVolume of data acquisition requires HPC back endMulti-modal imaging with disparate channels of data Scalable key-value and object store databases needed. ImageJ, OMERO, VolRover, advanced segmentation and feature detection methods Machine learning (SVM and RF) for classification and recommendation services 1. distributed multi-modal high resolution experimental sources of bioimages (instruments).
2. 50TB of data data formats include images.
1. high-throughput computing with responsive analysis
2. segmentation of regions of interest, crowd-based selection and extraction of features, and object classification, and organization, and search.
3. advance biosciences discovery through big data techniques / extreme scale computing…. In-database processing and analytics. … Machine learning (SVM and RF) for classification and recommendation services. … advanced algorithms for massive image analysis. High-performance computational solutions.
4. massive data analysis toward massive imaging data sets.
1. ImageJ, OMERO, VolRover, advanced segmentation and feature detection methods from applied math researchers.... Scalable key-value and object store databases needed.
2. NERSC.s Hopper infrastructure
3. database and image collections.
4. 10 GB and future 100 GB and advanced networking (SDN).
1. 3D structural modeling 1. significant but optional security & privacy including secure servers and anonymization 1. workflow components include data acquisition, storage, enhancement, minimizing noise --
19M0078
Genomic Measurements
>100TB in 1-2 years at NIST; Healthcare community will have many PBsDNA sequencers can generate ~300GB compressed data/dayFile formats not well-standardized, though some standards exist. Generally structured data.Open-source sequencing bioinformatics software from academic groupsProcessing of raw data to produce variant calls. Clinical interpretation of variants 1. high throughput compressed data (300GB/day) from various DNA sequencers
2. distributed data source (sequencers)
3. various file formats either structured and unstructured data
1. for processing raw data in variant calls
2. machine learning for complex analysis on systematic errors from sequencing technologies are hard to characterize
1. legacy computing cluster and other PaaS and IaaS (computing cluster)
2. huge data storage in PB range (storage)
3. Unix-based legacy sequencing bioinformatics software (sw pkg)
1. data format for Genome browsers 1. security and privacy protection on health records and clinical research databases -- 1. mobile platforms for physicians accessing genomic data (mobile device)
20M0188
Comparative analysis
50TBNew sequencers stream in data at growing rateBiological data is inherently heterogeneous, complex, structural, and hierarchical. Besides core genomic data, new types of 'Omics' data such as transcriptomics, methylomics, and proteomicsStandard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors.), Perl/Python wrapper scriptsDescriptive statistics, statistical significance in hypothesis testing, data clustering and classification 1. multiple centralized data sources
2. proteins and their structural features, core genomic data, new types of “Omics” data such as transcriptomics, methylomics, and proteomics describing gene expression
3. front real time Web UI interactive. Back end data loading processing must keep up with exponential growth of sequence data due to the rapid drop in cost of sequencing technology.
4. heterogeneous, complex, structural, and hierarchical biological data.
5. metagenomic samples can vary by several orders of magnitude, such as several hundred thousand genes to a billion genes
1. sequencing and comparative analysis techniques for highly complex data
2. descriptive statistics
1. huge data storage
2. scalable RDMS for heterogeneity biological data
3. real-time rapid and parallel bulk loading
4. Oracle RDBMS, SQLite files, flat text files, Lucy (a version of Lucene) for keyword searches, BLAST databases, USEARCH databases
5. Linux cluster, Oracle RDBMS server, large memory machines, standard Linux interactive hosts
1. real time interactive parallel bulk loading capability
2. interactive Web UI, backend precomputations, batch job computation submission from the UL
3. download assembled and annotated datasets for offline analysis.
4. ability to query and browse data via interactive Web UI.
5. visualize structure [of data] at different levels of resolution. Ability to view abstract representations of highly similar data.
1. login security - username and password
2. creation of user account to submit and access dataset to system via web interface
3. single sign on capability (SSO)
1. methods to improve data quality required.
2. data clustering, classification, reduction.
3. Integrate new data / content into the system.s data store annotate data
--
21M0140
Individualized Diabetes Management
5 million patientsnot real-time but updated periodicallyEach patient typically has 100 Controlled Vocabulary values and 1000 continuous values. Most values are time stamped.HDFS supplementing Mayo internal data warehouse called Enterprise Data Trust (EDT)Integrating data into semantic graph, using graph traverse to replace SQL join. Developing semantic graph mining algorithms to identify graph patterns, index graph, and search graph. Indexed Hbase. Custom code to develop new patient properties from stored data. 1. distributed EHR data
2. over 5 million patients with thousands of properties each and many more that are derived from primary values.
3. each record, a range of 100 to 100,000 data property values average of 100 controlled vocabulary values and 1000 continuous values.
4. none real-time but data updated periodically. Data is timestamped with the time of observation [time that the value is recorded.]
5. structured data about a patient falls into two main categories: data with controlled vocabulary [CV] property values, and continuous property values [which are recorded / captured more frequently].
6. data consists of text, and Continuous Numerical values.
1. data integration, using ontological annotation and taxonomies
2. parallel retrieval algorithms for both indexed and custom searches, identify data of interest. Patient cohorts, patients meeting certain criteria, patients sharing similar characteristics
3. distributed graph mining algorithms, pattern analysis and graph indexing, pattern searching on RDF triple graphs
4. robust statistical analysis tools to manage false discovery rate, determine true subgraph significance, validate results, eliminate false positive / false negative results
5. semantic graph mining algorithms to identify graph patterns, index and search graph.
6. Semantic graph traversal.
1. data warehouse, open source indexed Hbase
2. supercomputers, cloud and parallel computing
3. I/O intensive processing
4. HDFS storage
5. custom code to develop new properties from stored data.
-- 1. protection of health data in accordance with legal requirements - eg, HIPAA - and privacy policies.
2. security policies for different user roles.
1. data annotated based on domain ontologies or taxonomies.
2. to ensure traceability of data, from origin [initial point of collection] through to use.
3. convert data from existing data warehouse into RDF triples
1. mobile access
22M0174
Statistical Relational Artificial Intelligence for Health Care
100s of GBs for a single cohort of a few hundred people. When dealing with millions of patients, this can be in the order of 1 petabyte.Electronic Health Records can be constantly updated. In other controlled studies, the data often comes in batches in regular intervals.Critical Feature. Data is typically in multiple tables and need to be merged in order to perform the analysis.Mainly Java based, in house tools are used to process the data. Relational probabilistic models (Statistical Relational AI) learnt from multiple data types 1. centralized data, with some data retrieved from internet sources
2. range from 100.s of GB sample size, to 1 petabyte for very large studies
3. both constant updates / additions [to subset of data] and scheduled batch inputs
4. large, multi-modal, longitudinal data
5. rich relational data comprised of multiple tables, different data types such as imaging, EHR, demographic, genetic and natural language data requiring rich representation
6. unpredictable arrival rates, in many cases data arrive in real-time
1. relational probabalistic models / probability therory, software learns models from multiple data types and can possibly integrate the information and reason about complex queries.
2. robust and accurate learning methods to account for .data imbalance. [where large amount of data is available for a small number of subjects]
3. learning algorithms to identify skews in data, so as to not -- incorrectly -- model .noise.
4. learned models can be generalized and refined in order to be applied to diverse sets of data
5. challenging, must accept data in different modalities [and from disparate sources]
1. Java, some in house tools, [relational] database and NoSQL stores
2. cloud and parallel computing
3. high performance computer, 48 GB RAM [to performa analysis for a moderate sample size]
4. clusters for large datasets
5. 200 GB - 1 TB hard drive for test data
1. visualization of subsets of very large data 1. secure handling and processing of data is of crucial importance in medical domains 1. merging multiple tables before analysis
2. methods to validate data to minimize errors
--
23M0172
World Population Scale Epidemiological Study
100TBData feeding into the simulation is small but real time data generated by simulation is massive.Can be rich with avrious population activities, geographical, socio-economic, cultural variations Charm++, MPISimulations on a Synthetic population 1. file-based synthetic population either centralized or distributed sites
2. large volume real time output data
3. variety of output datasets depends on the complexity of the model
1. compute intensive and data intensive computation like supercomputer.s performance
2. unstructured and irregular nature of graph processing
3. summary of various runs of simulation
1. movement of very large amount of data for visualization (networking)
2. distributed MPI-based simulation system (platform)
3. Charm++ on multi-nodes (software)
4. network file system (storage)
5. infiniband network (networking)
1. visualization 1. protection of PII on individuals used in modeling
2. data protection and secure platform for computation
1. data quality and be able capture the traceability of quality from computation --
24M0173
Social Contagion Modeling for Planning
10s of TB per yearDuring social unrest events, human interactions and mobility leads to rapid changes in data; e.g., who follows whom in Twitter.Data fusion a big issue. How to combine data from different sources and how to deal with missing or incomplete data? Specialized simulators, open source software, and proprietary modeling environments. Databases.Models of behavior of humans and hard infrastructures, and their interactions. Visualization of results 1. traditional and new architecture for dynamic distributed processing on commondity clusters
2. fine-resolution models and datasets in to support Twitter network traffic
3. huge data storage per year
1. large scale modeling for various events (disease, emotions, behaviors, etc.)
2. scalable fusion between combined datasets
3. multi-levels analysis while generate sufficient results quickly
1. computing infrastructure which can capture human-to-human interactions on various social events via the Internet (infrastructure)
2. file servers and databases (platform)
3. Ethernet and Infiniband networking (networking)
4. specialized simulators, open source software, and proprietary modeling (application)
5. huge user accounts across country boundaries (networking)
1. multi-levels detail network representations
2. visualization with interactions
1. protection of PII on individuals used in modeling
2. data protection and secure platform for computation
1. data fusion from variety of dta sources
2. data consistency and no corruption
3. preprocessing of raw data
1. efficient method of moving data
25M0141
Biodiversity and LifeWatch
N/AReal time processing and analysis in case of the natural or industrial disasterRich variety and number of involved databases and observation data RDMSRequires advanced and rich visualization 1. special dedicated or overlay sensor network
2. storage distributed, historical and trends data archiving
3. data sources distributed, and include observation and monitoring facilities, sensor network, and satellites.
4. wide variety of data, including satellite images/information, climate and weather data, photos, video, sound recordings…
5. multi-type data combination and linkage potentially unlimited data variety
6. data streaming
1. data analysed incrementally and/or real-time at varying rates due to variations in source processes.
2. a variety of data, analytical and modeling tools to support analytics for diverse scientific communities.
3. parallel data streams and streaming analytics
4. access and integration of multiple distributed databases
1. expandable on-demand based storage resource for global users
2. cloud community resource required
1. advanced / rich / high definition visualization
2. 4D visualization
1. Federated identity management for mobile researchers and mobile sensors
2. access control and accounting
1. data storage and archiving, data exchange and integration
2. data lifecycle management: data provenance, referral integrity and identification traceability back to initial observational data
3. [In addition to original source data,] processed (secondary) data may be stored for future uses
4. provenance (and persistent identification (PID)) control of data, algorithms, and workflows
5. curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
1. access by mobile users
26M0136
Large-scale Deep Learning
Current datasets typically 1 to 10 TB. Training a self-driving car could take 100 million images.Much faster than real-time processing is required. For autonomous driving need to process 1000's high-resolution (6 megapixels or more) images per second.Neural Net very heterogeneous as it learns many different featuresIn-house GPU kernels and MPI-based communication developed by Stanford. C++/Python source.Small degree of batch statistical pre-processing; all other data analysis is performed by the learning algorithm itself. -- -- 1. GPU
2. high performance MPI and HPC Infiniband cluster
3. libraries for single-machine or single-GPU computation are available (e.g., BLAS, CuBLAS, MAGMA, etc.), distributed computation of dense BLAS-like or LAPACK-like operations on GPUs remains poorly developed. Existing solutions (e.g., ScaLapack for CPUs) are not well-integrated with higher level languages and require low-level programming which lengthens experiment and development time.
-- -- -- --
27M0171
Organizing large-scale
500+ billion photos on Facebook, 5+ billion photos on Flickr. over 500M images uploaded to Facebook each dayImages and metadata including EXIF tags (focal distance, camera type, etc), Hadoop Map-reduce, simple hand-written multithreaded tools (ssh and sockets for communication)Robust non-linear least squares optimization problem. Support Vector Machine 1. over 500M images uploaded to social media sites each day 1. classifier (e.g. a Support Vector Machine), a process that is often hard to parallelize.
2. features seen in many large scale image processing problems
1. Hadoop or enhanced MapReduce 1. visualize large-scale 3-d reconstructions, and navigate large-scale collections of images that have been aligned to maps. 1. preserve privacy for users and digital rights for media. -- --
28M0160
Truthy
30TB/year compressed dataNear real-time data storage, querying & analysisSchema provided by social media data source. Currently using Twitter only. We plan to expand incorporating Google+, FacebookHadoop IndexedHBase & HDFS. Hadoop, Hive, Redis for data management. Python: SciPy NumPy and MPI for data analysis.Anomaly detection, stream clustering, signal classification and online-learning; Information diffusion, clustering, and dynamic network visualization 1. distributed data sources
2. large volume real time streaming
3. raw data in compressed formats
4. fully structured data in JSON, users metadata, geo-locations data
5. multiple data schemas
1. various real time data analysis for anomaly detection, stream clustering, signal classification on multi-dimensional time series and online-learning 1. Hadoop and HDFS (platform)
2. IndexedHBase, Hive, SciPy, NumPy (software)
3. in-memory database, MPI (platform)
4. high-speed Infiniband network (networking)
1. data retrieval and dynamic visualization
2. data driven interactive web interfaces
3. API for data query
1. security and privacy policy 1. standardized data structured/formats with extremely high data quality 1. low-level data storage infrastructure for efficient mobile access to data
29M0211
Crowd Sourcing
Gigabytes (text, surveys, experiment values) to hundreds of terabytes (multimedia)Data continuously updated and analyzed incrementallySo far mostly homogeneous small data sets; expected large distributed heterogeneous datasets XML technology, traditional relational databases Pattern recognition (e.g., speech recognition, automatic A&V analysis, cultural patterns), identification of structures (lexical units, linguistic rules, etc) -- 1. digitize existing audio-video, photo and documents archives
2. analytics include pattern recognition of all kind (e.g., speech recognition, automatic A&V analysis, cultural patterns), identification of structures (lexical units, linguistic rules, etc.)
-- -- 1. privacy issues in preserving anonymity of responses in spite of computer recording of access ID and reverse engineering of unusual user responses -- --
30M0158
CINET
Can be hundreds of GB for a single network. 1000-5000 networks and methodsDynamic networks; network collectyion growingMany types of networksGraph libraries: Galib, NetworkX. Distributed Workflow Management: Simfrastructure, databases, semantic web tools Network Visualization 1. a set of network topologies file to study graph theoretic properties and behaviors of various algorithms
2. asynchronous and real time synchronous distributed computing
1. environments to run various network and graph analysis tools
2. dynamic grow of the networks
3. asynchronous and real time synchronous distributed computing
4. different parallel algorithms for different partitioning schemes for efficient operation
1. large file system (storage)
2. various network connectivity (networking)
3. existing computing cluster
4. EC2 computing cluster
5. various graph libraries, management toos, databases, semantic web tools
1. client side visualization -- -- --
31M0190
NIST IAD
>900M Web pages occupying 30 TB of storage, 100M tweets, 100M ground-truthed biometric images, 100,000's partially ground-truthed video clips, and terabytes of smaller fully ground-truthed test collections. Most legacy evaluations are focused on retrospective analytics. Newer evaluations are focusing on simulations of real-time analytic challenges from multiple data streams.Wide variety of data types including textual search/extraction, machine translation, speech recognition, image and voice biometrics, object and person recognition and tracking, document analysis, human-computer dialogue, and multimedia search/extraction. PERL, Python, C/C++, Matlab, R development tools. Create ground-up test and measurement applications.Information extraction, filtering, search, and summarization; image and voice biometrics; speech recognition and understanding; machine translation; video person/object detection and tracking; event detection; imagery/document matching; novelty detection; structural semantic temporal analytics 1. large amounts of semi-annotated web pages, tweets, images, video
2. scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users
1. analytic algorithms working with written language, speech, human imagery, etc. must generally be tested against real or realistic data. It’s extremely challenging to engineer artificial data that sufficiently captures the variability of real data involving humans 1. PERL, Python, C/C++, Matlab, R development tools. Create ground-up test and measurement applications 1. analytic flows involving users 1. security requirements for protecting sensitive data while enabling meaningful developmental performance evaluation. Shared evaluation testbeds must protect the intellectual property of analytic algorithm developers -- --
32M0130
DataNet
Petabytes, hundreds of millions of filesReal Time & BatchRichIntegrated Rule Oriented Data System (iRODS)Supports general analysis workflows 1. process key format types NetCDF, HDF5, Dicom
2. real-time and batch data
1. needs to provide general analytics workflows 1. iRODS data management software
2. interoperability across Storage and Network Protocol Types
1. general visulaization workflows 1. Federate across existing authentication environments through Generic Security Service API and Pluggable Authentication Modules (GSI, Kerberos, InCommon, Shibboleth).
2. access controls on files independently of the storage location.
-- --
33M0163
The Discinnet process
Small as metadata to Big DataReal TimeCan tackle arbitrary Big DataSymfony-PHP, Linux, MySQLX 1. integration of metadata approaches across disciplines -- 1. software: Symfony-PHP, Linux, MySQL -- 1. significant but optional security & privacy including secure servers and anonymization 1. integration of metadata approaches across disciplines --
34M0131
Semantic Graph-search
Few TerabytesEvolving in TimeRichdatabaseData graph processing 1. all data types, image to text, structures to protein sequence 1. data graph processing
2. RDMS
1. cloud community resource required 1. efficient data-graph based visualization is needed -- -- --
35M0189
Light source beamlines
50-400 GB per day. Total ~400 TBContinuous stream of Data but analysis need not be real timeImagesOctopus for Tomographic Reconstruction, Avizo (http://vsg3d.com) and FIJI (a distribution of ImageJ) Volume reconstruction, feature identification, etc. 1. multiple streams of real time data to be stored and analyzed later
2. sample data to be analyzed in real time
1. standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors…), Perl/Python wrapper scripts, Linux Cluster scheduling 1. high volume data transfer to remote batch processing resource -- 1. multiple security & privacy requirements to be satisfied -- --
36M0170
Catalina Real-Time Transient Survey
~100TB total increasing by 0.1TB a night accessing PB's of base astronomy data. Successor LSST will take 30TB a night in 2020's,Nightly update runs processed in real timeImages, spectra, time series, catalogs.Custom data processing pipeline and data analysis softwareDetection of rare events and relation to existing diverse data 1. ~0.1TB per day at present will increase by factor 100 1. a wide variety of the existing astronomical data analysis tools, plus a large amount of custom developed tools and software, some of it a research project in itself
2. automated classification with machine learning tools given the very sparse and heterogeneous data, dynamically evolving in time as more data come in, with follow-up decision making reflecting limited follow up resources
-- 1. visualization mechanisms for highly dimensional data parameter spaces -- -- --
37M0185
DOE Extreme Data
Several petabytes from Dark Energy Survey and Zwicky Transient Factory. Simulations > 10PBAnalysis done in batch mode with data from observations and simulations updated dailyImage and simulation dataMPI, FFTW, viz packages, FFTW, numpy, Boost, OpenMP, ScaLAPCK, PSQL & MySQL databases, Eigen, cfitsio, astrometry.net, and Minuit2New analytics needed to analyze simulation results 1. ~1 PB per year becoming 7PB a year observational datal 1. interpretation of results from detailed simulations requires advanced analysis and visualization techniques and capabilities 1. MPI, OpenMP, C, C++, F90, FFTW, viz packages, python, FFTW, numpy, Boost, OpenMP, ScaLAPCK, PSQL & MySQL databases, Eigen, cfitsio, astrometry.net, and Minuit2
2. supercomputer I/O subsystem limitations must be addressed
1. interpretation of results using advanced visualization techniques and capabilities -- -- --
38M0209
Large Survey Data for Cosmology
Dark Energy Survey will take PB's of data400 images of one gigabyte in size per nightImagesLinux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources. Standard astrophysics reduction software as well as Perl/Python wrapper scriptsMachine Learning to find optical transients. Cholesky decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage 1. 20TB data per day 1. analysis on both the simulation and observational data simultaneously
2. techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side
1. standard astrophysics reduction software as well as Perl/Python wrapper scripts
2. Oracle RDBMS, Postgres psql, as well as GPFS and Lustre file systems and tape archives
3. parallel image storage
-- -- 1. links between remote telescopes and central analysis sites --
39M0166
Particle Physics
15 PB's of data (experiment and Monte Carlo combined) per yearData updated continuously with sophisticated real-time selection and test analysis but all analyzed "properly" offlineEach stage in analysis has different format but data uniform within each analysis stageGrid-based environment with over 350,000 cores running simultaneouslySophisticated specialized data analysis code followed by basic exploratory statistics (histogram) with complex detector efficiency corrections 1. real time data from Accelerator and Analysis instruments
2. asynchronization data collection
3. calibration of instruments
1. experimental data from ALICE, ATLAS, CMS, LHb
2. histograms, scatter-plots with model fits
3. Monte-Carlo computations
1. legacy computing infrastructure (computing nodes)
2. distributed cached files (storage)
3. object databases (sw pkg)
1. histograms and model fits (visual) 1. data protection 1. data quality on complex apparatus --
40M0210
Belle II High Energy Physics Experiment
Eventually 120 PB of Monte Carlo and observational dataData updated continuously with sophisticated real-time selection and test analysis but all analyzed "properly" offlineEach stage in analysis has different format but data uniform within each analysis stageWill use DIRAC Grid softwareSophisticated specialized data analysis code followed by basic exploratory statistics (histogram) with complex detector efficiency corrections 1. 120PB Raw data -- 1. 120PB Raw data
2. International distributed computing model to augment that at acceleartor (Japan)
3. data transfer of ~20Gbps at designed luminosity between Japan and US
4. software from Open Science Grid, Geant4, DIRAC, FTS, Belle II framework
-- 1. standard Grid authentication -- --
41M0155
EISCAT 3D incoherent scatter radar system
Terabytes per year today but 40PB/year starting ~2022Data updated continuously with real time test analysis and batch full analysisBig Data UniformCustom analysis based on flat file data storagePattern recognition, demanding correlation routines, high level parameter extraction 1. remote sites generating 40PB data per year by 2022
2. HDF5 data format
3. visualization of high (>=5) dimension data
1. Queen Bea architecture with mix of distributed on-sensor and central processing for 5 distributed sites
2. realtime monitoring of equipment by partial streaming analysis
3. needs to host rich set of Radar image processing services using machine learning, statistical modelling, and graph algorithms
1. architecture compatible with ENVRI Environmental Research Infrastructure collaboration 1. needs to suupport visualization of high (>=5) dimension data -- 1. preservation of data and avoid lost data due to instrument malfunction 1. needs to suupport realtime monitoring of equipment by partial streaming analysis
42M0157
ENVRI
Apart from EISCAT 3D given above, these are low volume. One system EPOS ~15 TB/yearMainly real time data streamsThis is 6 separate projects with common architecture for infrastructure. So data very diverse across projectsR and Pytion (Matplotlib) for visulaization. Custom software for processingData assimilation, (Statistical) analysis, Data mining, Data extraction, Scientific modeling and simulation, Scientific workflow 1. huge volume real time distributed data sources
2. variety of instrumentation datasets and metadata
1. diversified analytics tools 1. variety of computing infrastructures and architectures (infrastructure)
2. scattered repositories (storage)
1. graph plotting tools
2. time-series interactive tools
3. brower-based flash playback
4. earth high-resolution map display
5. visual tools for quality comparisons
1. open data policy with minor restrictions 1. high quality on data
2. mirror archives
3. various metadata frameworks
4. scattered repositories and data curation
1. various kind of mobile sensor devices for data acquisition
43M0167
CReSIS
Current data around a PB increasing by 50-100TB per mission. Future expedition ~PB eachData taken in ~2 month missions including test analysis and then later batch processingRaw data, images with final layer data used for scienceMatlab for custom raw data processing. Custom image processing software. User Interface is a Geographical Information SystemCustom signal processing to produce Radar images that are analyzed by image processing to find layers 1. needs to provide reliable data transmission from aircraft sensors/instruments or removable disks from remote sites
2. data gathering in real time
3. varieties of datasets
1. legacy software (Matlab) and language (C/Java) binding for processing
2. needs signal processing and advance image processing to find layers
1. ~0.5 Petabytes/year of raw data
2. transfer content from removable disk to computing cluster for parallel processing
3. MapReduce or MPI plus language binding for C/Java
1. GIS user interface
2. rich user interface for simulations
1. security and privacy on political sensitive issues
2. dynamic security and privacy policy mechanisms
1. data quality assurance 1. monitoring data collection instruments/sensors
44M0127
UAVSAR Data Processing
Rawdata 110TB and 40TB processed plus smaller samplesData comes from aircraft and so incrementally added. Data occasionally get reprocessed: new processing methods or parametersImage and annotation filesROI_PAC, GeoServer, GDAL, GeoTIFF-supporting tools. Moving to CloudsProcess Raw Data to get images which are run through image processing tools and accessed from GIS 1. angular as well as spatial data
2. compatibility with other NASA Radar systems and repositories (Alaska Satellite Facility)
1. geolocated data requires GIS integration of data as custom overlays
2. significant human intervention in data processing pipeline
3. host rich set of Radar image processing services
4. ROI_PAC, GeoServer, GDAL, GeoTIFF-supporting tools
1. interoperable Cloud-HPC architecture should be supported
2. host rich set of Radar image processing services
3. ROI_PAC, GeoServer, GDAL, GeoTIFF-supporting tools
4. compatibility with other NASA Radar systems and repositories (Alaska Satellite Facility)
1. needs to suupport Support field expedition users with phone/tablet interface and low resolution downloads -- 1. significant human intervention in data processing pipeline
2. rich robust provenance defining complex machine/human processing
1. needs to suupport Support field expedition users with phone/tablet interface and low resolution downloads
45M0182
NASA LARC/GSFC iRODS
MERRA collection (below) represents about most of total data and there are other smaller collectionsPeriodic updates every 6 monthsmany applications combine MERRA reanalysis data with other reanalyses and observational data such as CERESSGE Univa Grid Engine Version 8.1, iRODS version 3.2 and/or 3.3, IBM Global Parallel File System (GPFS) version 3.4, Cloudera version 4.5.2-1.Federation software 1. Federate distributed heterogeneous datasets 1. Climate Analytics as a Service on Clouds 1. Support virtual Climate Data Server (vCDS)
2. GPFS Parallel File System integrated with Hadoop
3. iRODS
1. needs to suupport visualize distributed heterogeneous data -- -- --
46M0129
MERRA
MERRA is 480TBIncreases at ~1TB/monthapplications combine MERRA reanalysis data with other re-analyses and observational data. Cloudera, iRODS, Amazon AWSClimate Analytics-as-a-Service (CAaaS). 1. Integrate simulation output and observational data, NetCDF files
2. real time and batch mode both needed
3. Interoperable Use of Amazon AWS and local clusters
4. iRODS data management
1. Climate Analytics as a Service on Clouds 1. NetCDF aware software
2. MapReduce
3. Interoperable Use of Amazon AWS and local clusters
1. high end distributed visualization -- -- 1. Smart phone and Tablet access required
2. iRODS data management
47M0090
Atmospheric Turbulence
200TB (current), 500TB within 5 yearsData analyzed incrementallyRe-analysis datasets are inconsistent in format, resolution, semantics, and metadata. Likely each of these input streams will have to be interpreted/analyzed into a common productMapReduce or the like; SciDB or other scientific database.Data mining customized for specific event types 1. real time distributed datasets
2. various formats, resolution, semantics, and metadata
1. MapReduce, SciDB, and other scientific databases
2. continuously computing for updates
3. event-specification language for data mining and event searching
4. semantics interpretation and optimal structuring for 4-dimensional data mining and predictive analysis
1. other legacy computing systems (e.g. supercomputer)
2. high throughput data transmission over the network
1. visualization to interpret results -- 1. validation for output products (correlations) --
48M0186
Climate Studies
Up to 30 PB/year from 15 end to end simulations at NERSC. More at other HPC centers42 Gbytes/sec from simulationsVariety across simulation groups and between observation and simulationNCAR PIO library and utilities NCL and NCO, parallel NetCDFNeed analytics next to data storage 1. ~100PB data in 2017 streaming at high data rates from large supercomputers across world
2. Integrate large scale distributed data from simulations with diverse observations
3. link diverse data to novel HPC simulation
1. ata analytics close to data storage 1. extend architecture to several other fields 1. share data with worldwide climate
2. high end distributed visualization
-- -- 1. phone based input and access
49M0183
DOE-BER Subsurface Biogeochemistry
N/AN/AFrom 'omics of the microbes in the soil to watershed hydro-biogeochemistry. From observation to simulationPFLOWTran, postgres, HDF5, Akuna, NEWT, etcData mining, data quality assessment, cross-correlation across datasets, reduced model development, statistics, quality assessment, data fusion 1. heterogeneous Diverse data with different domains and scales, Translation across diverse datasets that cross domains and scales
2. synthesis of diverse and disparate field, laboratory, omic, and simulation datasets across different semantic, spatial, and temporal scales
3. link diverse data to novel HPC simulation
-- 1. postgres, HDF5 data technologies and many custom software systems 1. phone based input and access -- -- 1. phone based input and access
50M0184
DOE-BER AmeriFlux and FLUXNET Networks
N/AStreaming data from ~150 towers in AmeriFlux and over 500 towers distributed globally collecting flux measurementsFlux data needs to be merged with biological, disturbance, and other ancillary data EddyPro, Custom analysis software, R, python, neural networks, MatlabData mining, data quality assessment, cross-correlation across datasets, data assimilation, data interpolation, statistics, quality assessment, data fusion 1. heterogeneous Diverse data with different domains and scales, Translation across diverse datasets that cross domains and scales
2. link to many other environment and biology datasets
3. link to HPC Climate and other simulations
4. link to European data sources and projects
5. access data from 500 distributed sources
1. Custom Software: EddyPro, Custom analysis software, R, python, neural networks, Matlab 1. custom software like EddyPro and analysis software like R, python, neural netowrks, Matlab
2. analytics includes data mining, data quality assessment, cross-correlation across datasets, data assimilation, data interpolation, statistics, quality assessment, data fusion, etc.
1. phone based input and access -- -- 1. phone based input and access
51M0223
Consumption forecasting in Smart Grids
4 TB a year for a city with 1.4M sensors like Los AngelesStreaming data from million(s) of sensorsTuple-based: Timeseries, database rows; Graph-based: Network topology, customer connectivity; Some semantic data for normalizationR/Matlab, Weka, Hadoop. GIS based visualizationForecasting models, machine learning models, time series analysis, clustering, motif detection, complex event processing, visual network analysis 1. diverse data from Smart Grid sensors, City planning, weather, utilities 2, data updated every 15 minutes 1. new machine learning analytics to predict consumption 1. SQL databases, CVS files, HDFS (platform)
2. R/Matlab, Weka, Hadoop (platform)
-- 1. privacy and anonymization by aggregation -- 1. mobile access for clients

 

Upcoming/Past Events

2nd NIST Big Data Workshop, NIST,
June 1 & 2, 2017
IEEE NBD-PWG Workshop, October 27, 2014
1st NIST Big Data Workshop, NIST, September 30, 2013

NIST is an agency of the
U.S. Commerce Department

Please send feedback to: bigdatainfo@nist.gov
Last updated: January 11, 2017

Privacy Police | Security Notices FOIA
Accessibility Statement | Disclaimer