NIST Big Data Program

 

Welcome to NIST Big Data Public Working Group (NBD-PWG)!
Search  
 Home
 NBD-WG/Subgroups
   Charter
   Co-Chairs
   Guidelines
   All WG Meeting

 Documents
   Version 2 Final Docs
   Version 1 Final Docs
   Docs Repository
   Use Cases Listing
   Upload Document

 Registration
   New User
   Update Profile

 Points of Contact
   Wo Chang
     NIST / ITL
     Digital Data Advisor
     
   James St Pierre
     NIST / ITL
     Deputy Director
     
 

Use Cases and Requirements -- General + Detail
Use Cases V1.0 Submission [Requirements: UseCase | Summary | General | Gen+Ref | Gen+Ref+Gaps | Gen+Detail ]
(click M0180 to download full package and M0203 for full high-level use case descriptions)


Data Sources
General Requirement
1. needs to support reliable real time, asynchronize, streaming, and batch processing to collect data from centralized, distributed, and cloud data sources, sensors, or instruments
(28: M0078, M0090, M0103, M0127, M0129, M0140, M0141, M0147, M0148, M0157, M0160, M0160, M0162, M0165, M0166, M0166, M0167, M0172, M0173, M0174, M0176, M0177, M0183, M0184, M0186, M0188, M0191, M0215)
2. needs to support slow, bursty, and high throughput data transmission between data sources and computing clusters.
(22: M0078, M0148, M0155, M0157, M0162, M0165, M0167, M0170, M0171, M0172, M0174, M0176, M0177, M0184, M0185, M0186, M0188, M0191, M0209, M0210, M0219, M0223)
3. needs to support diversified data content ranging from structured and unstructured text, document, graph, web, geospatial, compressed, timed, spatial, multimedia, simulation, instrumental data.
(28: M0089, M0090, M0140, M0141, M0147, M0148, M0155, M0158, M0160, M0161, M0162, M0165, M0166, M0167, M0171, M0172, M0173, M0177, M0183, M0184, M0186, M0188, M0190, M0191, M0213, M0214, M0215, M0223)
M0147
1. 380 TB scanned documents at a centralized storage [centralized]
Volume: 380TB
Velocity: Static for 75 years
Variety: Scanned Documents
Requirements:
1. needs to support large document format from a centralized storage
M0148
1. Distributed data sources from federal agencies
2. Hundred of Terabytes, and growing
3. bursty data can arrive in batches of size ranging from GB to hundreds of TB
4. Variety data types, unstructured and structured data: textual documents, emails, photos, scanned documents, multimedia, social networks, web sites, databases, etc.
5. Data sources may be distributed in different clouds in future.
Volume: Hundred of Terabytes, and growing.
Velocity: data loaded in batches so bursty
Variety: unstructured and structured data: textual documents, emails, photos, scanned documents, multimedia, social networks, web sites, databases, etc.
Requirements:
1. needs to support distrubted data sources
2. needs to support large data storage
3. needs to support bursty data range from GB to hundreds of TB
4. needs to support wide variety of data formats including unstructured and structured data
5. needs to support distributed data sources in different clouds
M0219
1. see use case
Volume: Approximately 1PB
Velocity: Variable. field data streamed continuously, Census was ~150 million records transmitted.
Variety: Strings and numerical data
Requirements:
1. needs to support data size approximately one petabyte
M0175
1. see use case
Volume: N/A
Velocity: Real Time
Variety: No
Requirements:
1. needs to support real time ingestion of data
M0161
1. ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday
2. PDF documents and log files of social network and client activities, some image, spreadsheet, and presentation files
Volume: 15TB presently, growing about 1 TB/month
Velocity: Currently Hadoop batch jobs are scheduled daily. Future is real-time recommendation
Variety: PDF documents and log files of social network and client activities
Requirements:
1. needs to support file-based documents with constant new uploads
2. needs to support variety of file types such as PDF, social network log files, client activities images, spreadsheet, presentation files
M0164
1. Data varies from digital media to user rankings, user profiles and media properties for content-based recommendations
Volume: Summer 2012. 25 million subscribers; 4 million ratings per day; 3 million searches per day; 1 billion hours streamed in June 2012. Cloud storage 2 petabytes (June 2013)
Velocity: Media (video and properties) and Rankings continually updated
Variety: Data varies from digital media to user rankings, user profiles and media properties for content-based recommendations
Requirements:
1. needs to support user profiles and ranking info
M0165
1. Distributed websites
2. 45B web pages, 500M photos upload each day
3. 100 hours of video uploaded to YouTube each minute
Volume: 45B web pages total, 500M photos uploaded each day, 100 hours of video uploaded to YouTube each minute
Velocity: Real Time updating and real time response to queries
Variety: Multiple media
Requirements:
1. Needs to support distributed data sources
2. Needs to support streaming data
3. Needs to support multimedia content
M0103
1. Centralized today
Volume: N/A
Velocity: Needs to become real-time. Currently updated at events
Variety: Event based
Requirements:
1. needs to support centralized and real time distributed sites/sensors
M0162
1. Extremely distributed with data repositories existing only for a very few fundamental properties, over 500,000 commercial materials
2. Many data sets and virtually no standards for mashups, materials are changing all the time, and new materials data are constantly being generated to describe the new materials
3. Numbers, graphical, images
Volume: 500,000 material types in 1980's. Much growth since then
Velocity: Ongoing increase in new materials
Variety: Many datasets with no standards
Requirements:
1. needs to support distributed data repositories for more than 500,000 commerical materials
2. needs to support may varieties of datasets
3. needs to support text, graphics, and images
M0176
1. Data streams from simulation at centralized peta/exascale systems
2. Widely distributed web of dataflows from central gateway to users
Volume: 100TB (current), 500TB within 5 years. Scalable key-value and object store databases needed.
Velocity: Simulations add data regularly
Variety: Varied data and simulation results
Requirements:
1. needs to support data streams from peta/exascale centralized simulation systems
2. needs to support distributed web dataflows from central gateway to users
M0213
1. see use case
Volume: Imagery -- 100s of Terabytes. Vector Data -- 10s of Gigabytes but billions of points
Velocity: Vectors transmitted in Near Real Time
Variety: Imagery. Vector (various formats shape files, KML, text streams) and many object structures
Requirements:
1. needs to support geospatial data requires unique approaches to indexing and distributed analysis.
M0214
1. see use case
Volume: FMV -- 30-60 frames per/sec at full color 1080P resolution. WALF -- 1-10 frames per/sec at 10Kx10K full color resolution,
Velocity: Real Time
Variety: A few standard imagery or video formats.
Requirements:
1. needs to support real-time data FMV – 30-60 frames per/sec at full color 1080P resolution and WALF – 1-10 frames per/sec at 10Kx10K full color resolution.
M0215
1. see use case
2. see use case
3. see use case
Volume: 10s of Terabytes to 100s of Petabytes. Individual warfighters (first responders) would have at most 1-100s of Gigabytes
Velocity: Much Real Time. Imagery intelligence device gathering petabyte in a few hours
Variety: Text files, raw media, imagery, video, audio, electronic data, human generated data
Requirements:
1. needs to support much of Data real-time with processing at worst near real time
2. needs to support data currently exists in disparate silos which must be accessible through a semantically integrated data space
3. needs to support diverse data includes text files, raw media, imagery, video, audio, electronic data, human generated data.
M0177
1. Clinical data from more than 1,100 discrete logical, operational healthcare sources
2. Volume: Data for more than 12 million patients, more than 4 billion discrete clinical observations. > 20 TB raw data
3. Velocity: Between 500,000 and 1.5 million new real-time clinical transactions per day
4. Variety: Wide variety of clinical data types including numeric, structured numeric, free-text, structured text, discrete nominal, discrete ordinal, discrete structured, binary large blobs (images and video).
5. Variability: the clinical and biological concept space is constantly evolving
6. As patients receive care in a variety of clinical settings, there.s a need to integrate and rationalize data across sources.
Volume: 12 million patients, more than 4 billion discrete clinical observations. > 20 TB raw data
Velocity: 0.5-1.5 million new real-time clinical transactions added per day.
Variety: Broad variety of data from doctors, nurses, laboratories and instruments
Requirements:
1. needs to support heterogeneous, high-volume, diverse data sources
2. needs to support volume: > 12 million entities (patients), > 4 billion records or data points (discrete clinical observations), aggregate of > 20 TB raw data
3. needs to support velocity: 500,000 - 1.5 million new transactions per day
4. needs to support variety: Formats include numeric, structured numeric, free-text, structured text, discrete nominal, discrete ordinal, discrete structured, binary large blobs (images and video).
5. needs to support data evolves over time in a highly variable fashion
6. needs to support a comprehensive and consistent view of data across sources, and over time
M0089
1. High resolution spatial objects in digitized pathology images from human tissues
2. Depend on pre-processing of tissue slides such as chemical staining and quality of image analysis algorithms
3. Raw images are whole slide images (mostly based on BIGTIFF), and analytical results are structured data (spatial boundaries and features)
4. Image analysis, spatial queries and analytics, feature clustering and classification
Volume: 1GB raw image data + 1.5GB analytical results per 2D image; 1TB raw image data + 1TB analytical results per 3D image. 1PB data per moderated hospital per year,
Velocity: Once generated, data will not be changed
Variety: Images
Requirements:
1. needs to support high resolution spatial digitized pathology images
2. needs to support various image quality analysis algorithms
3. needs to support various image data formats especially BIGTIFF with structured data for analytical results
4. needs to support image analysis, spatial queries and analytics, feature clustering and classification
M0191
1. Distributed experimental sources of multi-model higher resolution bioimages (instruments)
2. 50TB here now, but currently over a petabyte overall. A single scan on emerging machines is 32TB.
3. Biological samples are highly variable and their analysis workflows must cope with wide variation.
Volume: medical diagnostic imaging is annually around 70 PB. A single scan on emerging machines is 32TB
Velocity: Volume of data acquisition requires HPC back end
Variety: Multi-modal imaging with disparate channels of data
Requirements:
1. needs to support distributed multi-modal high resolution experimental sources of bioimages (instruments).
2. needs to support 50TB of data
data formats include images.
M0078
1. DNA sequencers can generate ~300GB compressed data/day and sequencing technologies have evolved very rapidly, and new technologies are on the horizon
2. Sequencers are distributed across many laboratories, though some core facilities exist
3. File formats not well-standardized, though some standards exist. Generally structured data
Volume: >100TB in 1-2 years at NIST; Healthcare community will have many PBs
Velocity: DNA sequencers can generate ~300GB compressed data/day
Variety: File formats not well-standardized, though some standards exist. Generally structured data.
Requirements:
1. needs to support high throughput compressed data (300GB/day) from various DNA sequencers
2. needs to support distributed data source (sequencers)
3. needs to support various file formats either structured and unstructured data
M0188
1. Data is centralized in multiple sites.
2. proteins and their structural features, core genomic data, new types of “Omics” data such as transcriptomics, methylomics, and proteomics describing gene expression under a variety of conditions must be incorporated into the comparative analysis system.
3. Front end web UI must be real time interactive. Back end data loading processing must keep up with exponential growth of sequence data due to the rapid drop in cost of sequencing technology.
4. Biological data is inherently heterogeneous, complex, structural, and hierarchical.
5. The sizes of metagenomic samples can vary by several orders of magnitude, such as several hundred thousand genes to a billion genes (e.g., latter in a complex soil sample).
Volume: 50TB
Velocity: New sequencers stream in data at growing rate
Variety: Biological data is inherently heterogeneous, complex, structural, and hierarchical. Besides core genomic data, new types of 'Omics' data such as transcriptomics, methylomics, and proteomics
Requirements:
1. needs to support multiple centralized data sources
2. needs to support proteins and their structural features, core genomic data, new types of “Omics” data such as transcriptomics, methylomics, and proteomics describing gene expression
3. needs to support front real time Web UI interactive. Back end data loading processing must keep up with exponential growth of sequence data due to the rapid drop in cost of sequencing technology.
4. needs to support heterogeneous, complex, structural, and hierarchical biological data.
5. needs to support metagenomic samples can vary by several orders of magnitude, such as several hundred thousand genes to a billion genes
M0140
1. Distributed EHR data
2. over 5 million patients with thousands of properties each and many more that are derived from primary values.
3. The number of property values could range from less than 100 (new patient) to more than 100,000 (long term patient) with typical patients composed of 100 CV values and 1000 continuous values.
4. Data not real-time but updated periodically. Most values are time based, i.e. a timestamp is recorded with the value at the time of observation.
5. Structured data, a patient has controlled vocabulary (CV) property values (demographics, diagnostic codes, medications, procedures, etc.) and continuous property values (lab tests, medication amounts, vitals, etc.).
6. Data consists of text, and Continuous Numerical values.
7. Data will be updated or added during each patient visit.
Volume: 5 million patients
Velocity: not real-time but updated periodically
Variety: Each patient typically has 100 Controlled Vocabulary values and 1000 continuous values. Most values are time stamped.
Requirements:
1. needs to support distributed EHR data
2. needs to support over 5 million patients with thousands of properties each and many more that are derived from primary values.
3. needs to support each record, a range of 100 to 100,000 data property values
average of 100 controlled vocabulary values and 1000 continuous values.
4. needs to support none real-time but data updated periodically. Data is timestamped with the time of observation [time that the value is recorded.]
5. needs to support structured data about a patient falls into two main categories: data with controlled vocabulary [CV] property values, and continuous property values [which are recorded / captured more frequently].
6. needs to support data consists of text, and Continuous Numerical values.
M0174
1. All the data about the users reside in a single disk file. Sometimes, resources such as published text need to be pulled from internet.
2. can be in 100s of GBs for a single cohort of a few hundred people. When dealing with millions of patients, this can be in the order of 1 petabyte.
3. In some cases, EHRs are constantly being updated. In other controlled studies, the data often comes in batches in regular intervals.
4. large, multi-modal, longitudinal data
5. Variety is key property in medical data sets… typically multiple tables need to be merged in order to perform the analysis
6. The arrival of data is unpredictable in many cases as they arrive in real-time.
Volume: 100s of GBs for a single cohort of a few hundred people. When dealing with millions of patients, this can be in the order of 1 petabyte.
Velocity: Electronic Health Records can be constantly updated. In other controlled studies, the data often comes in batches in regular intervals.
Variety: Critical Feature. Data is typically in multiple tables and need to be merged in order to perform the analysis.
Requirements:
1. needs to support centralized data, with some data retrieved from internet sources
2. needs to support range from 100.s of GB sample size, to 1 petabyte for very large studies
3. needs to support both constant updates / additions [to subset of data] and scheduled batch inputs
4. needs to support large, multi-modal, longitudinal data
5. needs to support rich relational data comprised of multiple tables, different data types such as imaging, EHR, demographic, genetic and natural language data requiring rich representation
6. needs to support unpredictable arrival rates, in many cases data arrive in real-time
M0172
1. Generated from synthetic population generator. Currently centralized. However, could be made distributed as part of post-processing
2. volume size in 100TB
3. velocity for rate of change depends on interactions with experts which can be real time large amount, data feed into simulation can be small but the output can be massive
3. Variety depends upon the complexity of the model over which the simulation is being performed. Can be very complex if other aspects of the world population such as type of activity, geographical, socio-economic, cultural variations are taken into account
Volume: 100TB
Velocity: Data feeding into the simulation is small but real time data generated by simulation is massive.
Variety: Can be rich with avrious population activities, geographical, socio-economic, cultural variations
Requirements:
1. needs to support file-based synthetic population either centralized or distributed sites
2. needs to support large volume real time output data
3. needs to support variety of output datasets depends on the complexity of the model
M0173
1. Distributed processing on many dynamic data sources with software running on commodity clusters and newer architectures and systems (e.g., clouds)
2. must have fine-resolution models and datasets to support human-to-human interactions over the Internet (e.g., Twitter)
3. easily 10s of TB per year of new data
Volume: 10s of TB per year
Velocity: During social unrest events, human interactions and mobility leads to rapid changes in data; e.g., who follows whom in Twitter.
Variety: Data fusion a big issue. How to combine data from different sources and how to deal with missing or incomplete data?
Requirements:
1. needs to support traditional and new architecture for dynamic distributed processing on commondity clusters
2. needs to support fine-resolution models and datasets in to support Twitter network traffic
3. needs to support huge data storage per year
M0141
1. May require special dedicated or overlay sensor network.
2. Storage Distributed, historical and trends data archiving
3. Ecological information from numerous observation and monitoring facilities and sensor network, satellite images/information, climate and weather, all recorded information.
4. satellite images/information, climate and weather, photos, video, sound recordings, all recorded information….
5. Relational data, key-value, complex semantically rich data
6. May require data streaming processing.
7. Support mobile sensors for data collection, use case lists several types of mobile sources
Volume: N/A
Velocity: Real time processing and analysis in case of the natural or industrial disaster
Variety: Rich variety and number of involved databases and observation data
Requirements:
1. needs to support special dedicated or overlay sensor network
2. needs to support storage distributed, historical and trends data archiving
3. needs to support data sources distributed, and include observation and monitoring facilities, sensor network, and satellites.
4. needs to support wide variety of data, including satellite images/information, climate and weather data, photos, video, sound recordings…
5. needs to support multi-type data combination and linkage
potentially unlimited data variety
6. needs to support data streaming
M0171
1. see use case
Volume: 500+ billion photos on Facebook, 5+ billion photos on Flickr.
Velocity: over 500M images uploaded to Facebook each day
Variety: Images and metadata including EXIF tags (focal distance, camera type, etc),
Requirements:
1. needs to support over 500M images uploaded to social media sites each day
M0160
1. Continuous real-time data-stream incoming from each source, distributed – with replication/redundancy
2. Acquisition and storage of a large volume of continuous streaming data from Twitter (~100 million messages per day, ~500GB data/day, ~30TB/year compressed data, increasing over time
3. Raw data stored in large compressed flat files
4. Fully-structured data (JSON format) enriched with users meta-data, geo-locations, etc.
5. Data schema provided by social media data source. Currently using Twitter only. We plan to expand incorporating Google+, Facebook
Volume: 30TB/year compressed data
Velocity: Near real-time data storage, querying & analysis
Variety: Schema provided by social media data source. Currently using Twitter only. We plan to expand incorporating Google+, Facebook
Requirements:
1. needs to support distributed data sources
2. needs to support large volume real time streaming
3. needs to support raw data in compressed formats
4. needs to support fully structured data in JSON, users metadata, geo-locations data
5. needs to support multiple data schemas
M0158
1. A single network remains in a single disk file accessible by multiple processors. However, during the execution of a parallel algorithm, the network can be partitioned and the partitions are loaded in the main memory of multiple processors
2. Challenging due to asynchronous distributed computation. Current systems are designed for real time synchronous response
Volume: Can be hundreds of GB for a single network. 1000-5000 networks and methods
Velocity: Dynamic networks; network collectyion growing
Variety: Many types of networks
Requirements:
1. needs to support a set of network topologies file to study graph theoretic properties and behaviors of various algorithms
2. needs to support asynchronous and real time synchronous distributed computing
M0190
1. see use case
2. see use case
Volume: >900M Web pages occupying 30 TB of storage, 100M tweets, 100M ground-truthed biometric images, 100,000's partially ground-truthed video clips, and terabytes of smaller fully ground-truthed test collections.
Velocity: Most legacy evaluations are focused on retrospective analytics. Newer evaluations are focusing on simulations of real-time analytic challenges from multiple data streams.
Variety: Wide variety of data types including textual search/extraction, machine translation, speech recognition, image and voice biometrics, object and person recognition and tracking, document analysis, human-computer dialogue, and multimedia search/extraction.
Requirements:
1. needs to support large amounts of semi-annotated web pages, tweets, images, video
2. needs to support scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users
M0130
1. see use case
2. see use case
Volume: Petabytes, hundreds of millions of files
Velocity: Real Time & Batch
Variety: Rich
Requirements:
1. needs to support process key format types NetCDF, HDF5, Dicom
2. needs to support real-time and batch data
M0163
1. see use case
Volume: Small as metadata to Big Data
Velocity: Real Time
Variety: Can tackle arbitrary Big Data
Requirements:
1. needs to support integration of metadata approaches across disciplines
M0131
1. see use case
Volume: Few Terabytes
Velocity: Evolving in Time
Variety: Rich
Requirements:
1. needs to support all data types, image to text, structures to protein sequence
M0189
1. see use case
2. see use case
Volume: 50-400 GB per day. Total ~400 TB
Velocity: Continuous stream of Data but analysis need not be real time
Variety: Images
Requirements:
1. needs to support multiple streams of real time data to be stored and analyzed later
2. needs to support sample data to be analyzed in real time
M0170
1. see use case
2. see use case
Volume: ~100TB total increasing by 0.1TB a night accessing PB's of base astronomy data. Successor LSST will take 30TB a night in 2020's,
Velocity: Nightly update runs processed in real time
Variety: Images, spectra, time series, catalogs.
Requirements:
1. needs to support ~0.1TB per day at present
will increase by factor 100
M0185
1. see use case
Volume: Several petabytes from Dark Energy Survey and Zwicky Transient Factory. Simulations > 10PB
Velocity: Analysis done in batch mode with data from observations and simulations updated daily
Variety: Image and simulation data
Requirements:
1. needs to support ~1 PB per year becoming 7PB a year observational datal
M0209
1. see use case
Volume: Dark Energy Survey will take PB's of data
Velocity: 400 images of one gigabyte in size per night
Variety: Images
Requirements:
1. needs to support 20TB data per day
M0166
1. 15 Petabytes per year from Accelerator and Analysis
2. Real time with some long shut downs with no data
3. few hundred final particle but all data is collection of particles after initial analysis, Huge effort to make certain complex apparatus well understood and corrections properly applied to data. Often requires data to be re-analyzed
Volume: 15 PB's of data (experiment and Monte Carlo combined) per year
Velocity: Data updated continuously with sophisticated real-time selection and test analysis but all analyzed "properly" offline
Variety: Each stage in analysis has different format but data uniform within each analysis stage
Requirements:
1. needs to support real time data from Accelerator and Analysis instruments
2. needs to support asynchronization data collection
3. needs to support calibration of instruments
M0210
1. see use case
Volume: Eventually 120 PB of Monte Carlo and observational data
Velocity: Data updated continuously with sophisticated real-time selection and test analysis but all analyzed "properly" offline
Variety: Each stage in analysis has different format but data uniform within each analysis stage
Requirements:
1. needs to support 120PB Raw data
M0155
1. see use case
2. see use case
3. see use case
Volume: Terabytes per year today but 40PB/year starting ~2022
Velocity: Data updated continuously with real time test analysis and batch full analysis
Variety: Big Data Uniform
Requirements:
1. needs to support remote sites generating 40PB data per year by 2022
2. needs to support HDF5 data format
3. needs to support visualization of high (>=5) dimension data
M0157
1. distributed, long-term, remote controlled observational networks, continuous raw data coming from approximately more than 1000 stations recording about 40GB per day, so over 15 TB per year, within EISCAT 3D raw voltage data will reach 40PB/year in 2023
2. instrumentation measurement datasets, metadata, ontology, annotations
Volume: Apart from EISCAT 3D given above, these are low volume. One system EPOS ~15 TB/year
Velocity: Mainly real time data streams
Variety: This is 6 separate projects with common architecture for infrastructure. So data very diverse across projects
Requirements:
1. needs to support huge volume real time distributed data sources
2. needs to support variety of instrumentation datasets and metadata
M0167
1. Aircraft flying over ice sheets in carefully planned paths with data downloaded to disks
2. all data gathered in real time
3. Lots of different datasets – each needing custom signal processing but all similar in structure
Volume: Current data around a PB increasing by 50-100TB per mission. Future expedition ~PB each
Velocity: Data taken in ~2 month missions including test analysis and then later batch processing
Variety: Raw data, images with final layer data used for science
Requirements:
1. needs to provide reliable data transmission from aircraft sensors/instruments or removable disks from remote sites
2. needs to support data gathering in real time
3. needs to support varieties of datasets
M0127
1. see use case
2. see use case
Volume: Rawdata 110TB and 40TB processed plus smaller samples
Velocity: Data comes from aircraft and so incrementally added. Data occasionally get reprocessed: new processing methods or parameters
Variety: Image and annotation files
Requirements:
1. needs to support angular as well as spatial data
2. needs to support compatibility with other NASA Radar systems and repositories (Alaska Satellite Facility)
M0182
1. see use case
Volume: MERRA collection (below) represents about most of total data and there are other smaller collections
Velocity: Periodic updates every 6 months
Variety: many applications combine MERRA reanalysis data with other reanalyses and observational data such as CERES
Requirements:
1. needs to support Federate distributed heterogeneous datasets
M0129
1. see use case
2. see use case
3 see use case
4. see use case
Volume: MERRA is 480TB
Velocity: Increases at ~1TB/month
Variety: applications combine MERRA reanalysis data with other re-analyses and observational data.
Requirements:
1. needs to support Integrate simulation output and observational data, NetCDF files
2. needs to support real time and batch mode both needed
3. needs to support Interoperable Use of Amazon AWS and local clusters
4. needs to support iRODS data management
M0090
1. Distributed with incremental, 200TB (current), 500TB in 5 years
2. Re-analysis datasets are inconsistent in format, resolution, semantics, and metadata. Likely each of these input streams will have to be interpreted/analyzed into a common product.
Volume: 200TB (current), 500TB within 5 years
Velocity: Data analyzed incrementally
Variety: Re-analysis datasets are inconsistent in format, resolution, semantics, and metadata. Likely each of these input streams will have to be interpreted/analyzed into a common product
Requirements:
1. needs to support real time distributed datasets
2. needs to support various formats, resolution, semantics, and metadata
M0186
1. see use case
2. see use case
3. see use case
Volume: Up to 30 PB/year from 15 end to end simulations at NERSC. More at other HPC centers
Velocity: 42 Gbytes/sec from simulations
Variety: Variety across simulation groups and between observation and simulation
Requirements:
1. needs to support ~100PB data in 2017 streaming at high data rates from large supercomputers across world
2. needs to support Integrate large scale distributed data from simulations with diverse observations
3. needs to support link diverse data to novel HPC simulation
M0183
1. see use case
2. see use case
3. see use case
Volume: N/A
Velocity: N/A
Variety: From 'omics of the microbes in the soil to watershed hydro-biogeochemistry. From observation to simulation
Requirements:
1. needs to support heterogeneous Diverse data with different domains and scales, Translation across diverse datasets that cross domains and scales
2. needs to support synthesis of diverse and disparate field, laboratory, omic, and simulation datasets across different semantic, spatial, and temporal scales
3. needs to support link diverse data to novel HPC simulation
M0184
1. see use case
2. see use case
3. see use case
4. see use case
5. see use case
Volume: N/A
Velocity: Streaming data from ~150 towers in AmeriFlux and over 500 towers distributed globally collecting flux measurements
Variety: Flux data needs to be merged with biological, disturbance, and other ancillary data
Requirements:
1. needs to support heterogeneous Diverse data with different domains and scales, Translation across diverse datasets that cross domains and scales
2. needs to support link to many other environment and biology datasets
3. needs to support link to HPC Climate and other simulations
4. needs to support link to European data sources and projects
5. needs to support access data from 500 distributed sources
M0223
1. see use case
2. see use case
Volume: 4 TB a year for a city with 1.4M sensors like Los Angeles
Velocity: Streaming data from million(s) of sensors
Variety: Tuple-based: Timeseries, database rows; Graph-based: Network topology, customer connectivity; Some semantic data for normalization
Requirements:
1. needs to support diverse data from Smart Grid sensors, City planning, weather, utilities
2, needs to support data updated every 15 minutes
Transformation
General Requirement
1. needs to support diversified compute intensive, analytic processing and machines learning techniques
(38: M0078, M0089, M0103, M0127, M0129, M0140, M0141, M0148, M0155, M0157, M0158, M0160, M0161, M0164, M0164, M0166, M0166, M0167, M0170, M0171, M0172, M0173, M0174, M0176, M0177, M0182, M0185, M0186, M0190, M0191, M0209, M0211, M0213, M0214, M0215, M0219, M0222, M0223)
2. needs to support batch and real time analytic processing
(7: M0090, M0103, M0141, M0155, M0164, M0165, M0188)
3. needs to support processing large diversified data content and modeling
(15: M0078, M0089, M0127, M0140, M0158, M0162, M0165, M0166, M0166, M0167, M0171, M0172, M0173, M0176, M0213)
4. needs to support processing data in motion (streaming, fetching new content, tracking, etc.)
(6: M0078, M0090, M0103, M0164, M0165, M0166)
M0148
1. Crawl/index
2. ranking, Data categorization (sensitive, confidential, etc.) PII data detection and flagging
3. Perform pre-processing
4. manage for long-term of large and varied data
5. Search huge amount of data to ensure high relevancy and recall.
Software: Custom software, commercial search products, commercial databases.
Analytics: Crawl/index; search; ranking; predictive search. Data categorization (sensitive, confidential, etc.). Personally Identifiable Information (PII) detection and flagging
Requirements:
1. needs to support craw and index from distributed data sources
2. needs to support various analytics processing including ranking, data categorization, detect PII data
3. needs to support pre-processing of data
4. needs to support long-term preservation management fo large varied datasets.
5. needs to support hugh amount of data with high relevancy and recall.
M0219
1. see use case
Software: Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig
Analytics: recommendation systems, continued monitoring
Requirements:
1. needs to support analytics are required for recommendation systems, continued monitoring and general survey improvement.
M0222
1. see use case
Software: Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig
Analytics: New analytics to create reliable information from non traditional disparate sources
Requirements:
1. needs to support analytics to create reliable estimates using data from traditional survey sources, government administrative data sources and non-traditional sources from the digital economy
M0175
1. see use case
Software: Hadoop RDBMS XBRL
Analytics: Fraud Detection
Requirements:
1. needs to support real time analytic essential
M0161
1. Standard libraries for machine learning and analytics, LDA
2. a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources
3. have been slightly modified via third-part annotation tools or publisher watermarks and cover pages
Software: Hadoop, Scribe, Hive, Mahout, Python
Analytics: Standard libraries for machine learning and analytics, LDA, custom built reporting tools for aggregating readership and social activities per document
Requirements:
1. needs to support standard machine learning and analytics libraries
2. needs to support scalable and parallelized efficient way for matching between documents
3. needs to support third-party annotation tools or publisher watermarks and cover pages
M0164
1. Allow streaming of user selected movies to satisfy multiple objectives (for different stakeholders)
2. Find best possible ordering of a set of videos for a user (household) within a given context in real-time, maximize movie consumption
3. Recommender systems are always personalized and use logistic/linear regression, elastic nets, matrix factorization, clustering, latent Dirichlet allocation, association rules, gradient boosted decision trees and others
4. Rankings are intrinsically “rough” data and need robust learning algorithms
5. Analytics needs continued monitoring and improvement.
Software: Hadoop and Pig; Cassandra; Teradata
Analytics: Personalized recommender systems using logistic/linear regression, elastic nets, matrix factorization, clustering, latent Dirichlet allocation, association rules, gradient boosted decision trees and others. Also and streaming video delivery.
Requirements:
1. needs to support streaming video contents to multiple clients
2. needs to support analytic processing for matching clients. interest in movie selection
3. needs to support various analytic processing techniques for consumer personalization
4. needs to support robust learning algorithms
5. needs to support continued analytic processing based on the monitoring and performance results
M0165
1. Crawling, searching including topic based search
2. ranking, recommending, Link to user profiles and social network data
Software: MapReduce + Bigtable; Dryad + Cosmos. PageRank. Final step essentially a recommender engine
Analytics: Crawling; searching including topic based search; ranking; recommending
Requirements:
1. needs to support dynamic fetching content over the network
2. needs to support linking user profiles and social network data
M0137
1. see use case
2. see use case
Software: Hadoop, MapReduce, Open-source, and/or Vendor Proprietary such as AWS (Amazon Web Services), Google Cloud Services, and Microsoft
Analytics: Robust Backup
Requirements:
1. needs to support robust Backup algorithm
2. needs to support replicate recent changes
M0103
1. The identification of an item begins with the sender to the recipients and for all those in between with a need to know the location and time of arrive of the items while in transport. A new aspect will be status condition of the items which will include sensor information, GPS coordinates, and a unique identification
2. The data is in near real-time being updated when a truck arrives at a depot or upon delivery of the item to the recipient
Software: N/A
Analytics: Distributed Event Analysis identifying problems.
Requirements:
1. needs to support tracking items based on the unique identification with its sensor information, GPS coordinates
2. needs to support real time updates on tracking items
M0162
1. More complex material properties can require many (100s?) of independent variables to describe accurately. Virtually no activity no exists that is trying to identify and systematize the collection of these variables to create robust data sets.
Software: National programs (Japan, Korea, and China), application areas (EU Nuclear program), proprietary systems (Granta, etc.)
Analytics: No broadly applicable analytics
Requirements:
1. needs to support (100s) of independent variables by collecting these variables to create robust datasets
M0176
1. High-throughput computing (HTC), fine-grained tasking and queuing. Rapid start/stop for ensembles of tasks. Real-time data analysis for web-like responsiveness
2. Mashup of simulation outputs across codes and levels of theory. Formatting, registration and integration of datasets. Mashups of data across simulation scales
3. The targets for materials design will become more search and crowd-driven. The computational backend must flexibly adapt to new targets
4. MapReduce and search that join simulation and experimental data
Software: MongoDB, GPFS, PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, varied community codes
Analytics: MapReduce and search that join simulation and experimental data.
Requirements:
1. needs to support high-throughput computing real-time data analysis for web-like responsiveness
2. needs to support mashup of simulation outputs across codes
3. needs to support search and crowd-driven with computation backend be flexibly for new targets
4. needs to support MapReduce and search to join simulation and experimental data
M0213
1. see use case
2. see use case
3. use case
Software: Geospatially enabled RDBMS, ESRI ArcServer, Geoserver
Analytics: Closest point of approach, deviation from route, point density over time, Principal Component Analysis (PCA) and Independent Component Analysis (ICA)
Requirements:
1. needs to support analytics include Closest point of approach, deviation from route, point density over time, PCA and ICA
2. needs to support geospatial data requires unique approaches to indexing and distributed analysis.
M0214
1. see use case
Software: custom software and tools including traditional RDBM.s and display tools.
Analytics: Visualization as overlays on a GIS. Analytics are basic object detection analytics and integration with sophisticated situation awareness tools with data fusion.
Requirements:
1. needs to support Rich analytics with object identification, pattern recognition, crowd behavior, economic activity and data fusion
M0215
1. see use case
Software: Hadoop, Accumulo (Big Table), Solr, Natural Language Processing, Puppet (for deployment and security) and Storm. GIS
Analytics: Near Real Time Alerts based on patterns and baseline changes, Link Analysis, Geospatial Analysis, Text Analytics (sentiment, entity extraction, etc.)
Requirements:
1. needs to support analytics include NRT Alerts based on patterns and baseline changes
M0177
1. As patients receive care in a variety of clinical settings, there.s a need to integrate and rationalize data across sources.
2. Information retrieval and natural language processing techniques to extract clinical features. [Techniques include] feature selection, machine learning decision-models, maximum likelihood estimators and Bayesian networks. Note that techniques used to derive knowledge from this data are nascent. [Healthcare analytics field is growing and continues to evolve.]
Software: Teradata, PostgreSQL, MongoDB, Hadoop, Hive, R
Analytics: Information retrieval methods (TF-IDF) Natural Language Processing, maximum likelihood estimators and Bayesian networks.
Requirements:
1. needs to support a comprehensive and consistent view of data across sources, and over time
2. needs to support analytic techniques: Information retrieval, natural language processing, machine learning decision-models, maximum likelihood estimators and Bayesian networks
M0089
1.Develop high performance image analysis algorithms to extract spatial information from images
2. provide efficient spatial queries and analytics, and feature clustering and classification
3. Extreme large size, multi-dimensional, disease specific analytics, correlation with other data types (clinical data, -omic data)
Software: MPI for image analysis; MapReduce + Hive with spatial extension
Analytics: Image analysis, spatial queries and analytics, feature clustering and classification
Requirements:
1. needs to support high performance image analysis to extract spatial information
2. needs to support spatial queries and analytics, and feature clustering and classification
3. needs to support analytic processing on huge multi-dimensional large dataset and be able to correlate with other data types such as clinical data, -omic data.
M0191
1. High-throughput computing (HTC), responsive analysis
2. Segmentation of regions of interest, crowd-based selection and extraction of features, and object classification, and organization, and search.
3. Advance biosciences discovery through big data techniques / extreme scale computing…. In-database processing and analytics. … Machine learning (SVM and RF) for classification and recommendation services. … advanced algorithms for massive image analysis. High-performance computational solutions.
4. Community focused science gateways to to guide the application of massive data analysis toward massive imaging data sets.
Software: Scalable key-value and object store databases needed. ImageJ, OMERO, VolRover, advanced segmentation and feature detection methods
Analytics: Machine learning (SVM and RF) for classification and recommendation services
Requirements:
1. needs to support high-throughput computing with responsive analysis
2. needs to support segmentation of regions of interest, crowd-based selection and extraction of features, and object classification, and organization, and search.
3. needs to support advance biosciences discovery through big data techniques / extreme scale computing…. In-database processing and analytics. … Machine learning (SVM and RF) for classification and recommendation services. … advanced algorithms for massive image analysis. High-performance computational solutions.
4. needs to support massive data analysis toward massive imaging data sets.
M0078
1. Processing of raw data to produce variant calls. Also, clinical interpretation of variants, which is now very challenging
2. All sequencing technologies have significant systematic errors and biases, which require complex analysis methods and combining multiple technologies to understand, often with machine learning
Software: Open-source sequencing bioinformatics software from academic groups
Analytics: Processing of raw data to produce variant calls. Clinical interpretation of variants
Requirements:
1. needs to support for processing raw data in variant calls
2. needs to support machine learning for complex analysis on systematic errors from sequencing technologies are hard to characterize
M0188
1. Comparative analysis for metagenomes and genomes
2. Descriptive statistics, statistical significance in hypothesis testing, discovering new relationships, data clustering and classification is a standard part of the analytics. … Data reduction, removing redundancies through clustering, ...
Software: Standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors.), Perl/Python wrapper scripts
Analytics: Descriptive statistics, statistical significance in hypothesis testing, data clustering and classification
Requirements:
1. needs to support sequencing and comparative analysis techniques for highly complex data
2. needs to support descriptive statistics
M0140
1. Data integration, using ontological annotation and taxonomies
2. Stage 2: Needs efficient parallel retrieval algorithms, suitable for cloud or HPC, using open source Hbase with both indexed and custom search to identify patients of possible interest.
3. Stage 3: The EHR, as an RDF graph, provides a very rich environment for graph pattern mining. Needs new distributed graph mining algorithms to perform pattern analysis and graph indexing technique for pattern searching on RDF triple graphs.
4. Stage 4: Given the size and complexity of graphs, mining subgraph patterns could generate numerous false positives and miss numerous false negatives. Needs robust statistical analysis tools to manage false discovery rate and determine true subgraph significance and validate these through several clinical use cases.
5. Integrating data into semantic graph, using graph traverse to replace SQL join. Developing semantic graph mining algorithms to identify graph patterns, index graph, and search graph. Indexed Hbase.
6. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.
Software: HDFS supplementing Mayo internal data warehouse called Enterprise Data Trust (EDT)
Analytics: Integrating data into semantic graph, using graph traverse to replace SQL join. Developing semantic graph mining algorithms to identify graph patterns, index graph, and search graph. Indexed Hbase. Custom code to develop new patient properties from stored data.
Requirements:
1. needs to support data integration, using ontological annotation and taxonomies
2. needs to support parallel retrieval algorithms for both indexed and custom searches, identify data of interest. Patient cohorts, patients meeting certain criteria, patients sharing similar characteristics
3. needs to support distributed graph mining algorithms, pattern analysis and graph indexing, pattern searching on RDF triple graphs
4. needs to support robust statistical analysis tools to manage false discovery rate, determine true subgraph significance, validate results, eliminate false positive / false negative results
5. needs to support semantic graph mining algorithms to identify graph patterns, index and search graph.
6. needs to support Semantic graph traversal.
M0174
1. relational probabilistic models that have the capability of handling rich relational data and modeling uncertainty using probability theory. The software learns models from multiple data types and can possibly integrate the information and reason about complex queries.
2. sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance.
3. The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.
4. Models learned from one set of populations cannot be easily generalized across other populations with diverse characteristics. This requires that the learned models can be generalized and refined according to the change in the population characteristics.
5. Challenging due to different modalities of the data, human errors in data collection and validation
Software: Mainly Java based, in house tools are used to process the data.
Analytics: Relational probabilistic models (Statistical Relational AI) learnt from multiple data types
Requirements:
1. needs to support relational probabalistic models / probability therory, software learns models from multiple data types and can possibly integrate the information and reason about complex queries.
2. needs to support robust and accurate learning methods to account for .data imbalance. [where large amount of data is available for a small number of subjects]
3. needs to support learning algorithms to identify skews in data, so as to not -- incorrectly -- model .noise.
4. needs to support learned models can be generalized and refined in order to be applied to diverse sets of data
5. needs to support challenging, must accept data in different modalities [and from disparate sources]
M0172
1. Computation of the simulation is both compute intensive and data intensive. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters
2. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable
3. Summary of various runs and replicates of a simulation
Software: Charm++, MPI
Analytics: Simulations on a Synthetic population
Requirements:
1. needs to support compute intensive and data intensive computation like supercomputer.s performance
2. needs to support unstructured and irregular nature of graph processing
3. needs to support summary of various runs of simulation
M0173
1. How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced, storage requirements.
2. Fusion of different data types. Different datasets must be combined depending on the particular problem. How to quickly develop, verify, and validate new models for new applications.
3. What is appropriate level of granularity to capture phenomena of interest while generating results sufficiently quickly
Software: Specialized simulators, open source software, and proprietary modeling environments. Databases.
Analytics: Models of behavior of humans and hard infrastructures, and their interactions. Visualization of results
Requirements:
1. needs to support large scale modeling for various events (disease, emotions, behaviors, etc.)
2. needs to support scalable fusion between combined datasets
3. needs to support multi-levels analysis while generate sufficient results quickly
M0141
1. Data analysed incrementally and/or real-time processes dynamics corresponds to dynamics of biological and ecological processes.
2. Provide integrated access to a variety of data, analytical and modeling tools as served by a variety of collaborating initiatives.
3. Parallel data streams and streaming analytics
4. Access and integration of multiple distributed databases
Software: RDMS
Analytics: Requires advanced and rich visualization
Requirements:
1. needs to support data analysed incrementally and/or real-time at varying rates due to variations in source processes.
2. needs to support a variety of data, analytical and modeling tools to support analytics for diverse scientific communities.
3. needs to support parallel data streams and streaming analytics
4. needs to support access and integration of multiple distributed databases
M0171
1. see use case
2. see use case
Software: Hadoop Map-reduce, simple hand-written multithreaded tools (ssh and sockets for communication)
Analytics: Robust non-linear least squares optimization problem. Support Vector Machine
Requirements:
1. needs to support classifier (e.g. a Support Vector Machine), a process that is often hard to parallelize.
2. needs to support features seen in many large scale image processing problems
M0160
1. near real-time analysis of such data, for anomaly detection, stream clustering, signal classification on multi-dimensional time series and online-learning
Software: Hadoop IndexedHBase & HDFS. Hadoop, Hive, Redis for data management. Python: SciPy NumPy and MPI for data analysis.
Analytics: Anomaly detection, stream clustering, signal classification and online-learning; Information diffusion, clustering, and dynamic network visualization
Requirements:
1. needs to support various real time data analysis for anomaly detection, stream clustering, signal classification on multi-dimensional time series and online-learning
M0211
1. see use case
2. see use case
Software: XML technology, traditional relational databases
Analytics: Pattern recognition (e.g., speech recognition, automatic A&V analysis, cultural patterns), identification of structures (lexical units, linguistic rules, etc)
Requirements:
1. needs to support digitize existing audio-video, photo and documents archives
2. needs to support analytics include pattern recognition of all kind (e.g., speech recognition, automatic A&V analysis, cultural patterns), identification of structures (lexical units, linguistic rules, etc.)
M0158
1. provide a common web-based platform for accessing various (i) network and graph analysis tools such as SNAP, NetworkX, Galib, etc. (ii) real-world and synthetic networks, (iii) computing resources and (iv) data management systems to the end-user in a seamless manner
2. Two types of changes: (i) the networks are very dynamic and (ii) as the repository grows, we expect at least a rapid growth to lead to over 1000-5000 networks and methods in about a year, The rate of graph-based data is growing at increasing rate
3. Challenging due to asynchronous distributed computation. Current systems are designed for real time synchronous response
4. Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation
Software: Graph libraries: Galib, NetworkX. Distributed Workflow Management: Simfrastructure, databases, semantic web tools
Analytics: Network Visualization
Requirements:
1. needs to support environments to run various network and graph analysis tools
2. needs to support dynamic grow of the networks
3. needs to support asynchronous and real time synchronous distributed computing
4. needs to support different parallel algorithms for different partitioning schemes for efficient operation
M0190
1. see use case
Software: PERL, Python, C/C++, Matlab, R development tools. Create ground-up test and measurement applications.
Analytics: Information extraction, filtering, search, and summarization; image and voice biometrics; speech recognition and understanding; machine translation; video person/object detection and tracking; event detection; imagery/document matching; novelty detection; structural semantic temporal analytics
Requirements:
1. needs to support analytic algorithms working with written language, speech, human imagery, etc. must generally be tested against real or realistic data. It’s extremely challenging to engineer artificial data that sufficiently captures the variability of real data involving humans
M0130
1. see use case
Software: Integrated Rule Oriented Data System (iRODS)
Analytics: Supports general analysis workflows
Requirements:
1. needs to provide general analytics workflows
M0131
1. see use case
2. see use case
Software: database
Analytics: Data graph processing
Requirements:
1. needs to support data graph processing
2. needs to support RDMS
M0189
1. Standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors…), Perl/Python wrapper scripts, Linux Cluster scheduling
Software: Octopus for Tomographic Reconstruction, Avizo (http://vsg3d.com) and FIJI (a distribution of ImageJ)
Analytics: Volume reconstruction, feature identification, etc.
Requirements:
1. needs to support standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors…), Perl/Python wrapper scripts, Linux Cluster scheduling
M0170
1. see use case
2. see use case
Software: Custom data processing pipeline and data analysis software
Analytics: Detection of rare events and relation to existing diverse data
Requirements:
1. needs to support a wide variety of the existing astronomical data analysis tools, plus a large amount of custom developed tools and software, some of it a research project in itself
2. needs to support automated classification with machine learning tools given the very sparse and heterogeneous data, dynamically evolving in time as more data come in, with follow-up decision making reflecting limited follow up resources
M0185
1. see use case
Software: MPI, FFTW, viz packages, FFTW, numpy, Boost, OpenMP, ScaLAPCK, PSQL & MySQL databases, Eigen, cfitsio, astrometry.net, and Minuit2
Analytics: New analytics needed to analyze simulation results
Requirements:
1. needs to support interpretation of results from detailed simulations requires advanced analysis and visualization techniques and capabilities
M0209
1. see use case
2. see use case
Software: Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources. Standard astrophysics reduction software as well as Perl/Python wrapper scripts
Analytics: Machine Learning to find optical transients. Cholesky decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage
Requirements:
1. needs to support analysis on both the simulation and observational data simultaneously
2. needs to support techniques for handling Cholesky decompostion for thousands of simulations with matricies of order 1M on a side
M0166
1. processing experimental data specific to each experiment (ALICE, ATLAS, CMS, LHCb) and producing summary information
2. analysis uses “exploration” (histograms, scatter-plots) with model fits
3. Substantial Monte-Carlo computations to estimate analysis quality
Software: Grid-based environment with over 350,000 cores running simultaneously
Analytics: Sophisticated specialized data analysis code followed by basic exploratory statistics (histogram) with complex detector efficiency corrections
Requirements:
1. needs to support experimental data from ALICE, ATLAS, CMS, LHb
2. needs to support histograms, scatter-plots with model fits
3. needs to support Monte-Carlo computations
M0155
1. see use case
2. see use case
3. use case
Software: Custom analysis based on flat file data storage
Analytics: Pattern recognition, demanding correlation routines, high level parameter extraction
Requirements:
1. needs to support Queen Bea architecture with mix of distributed on-sensor and central processing for 5 distributed sites
2. needs to support realtime monitoring of equipment by partial streaming analysis
3. needs to host rich set of Radar image processing services using machine learning, statistical modelling, and graph algorithms
M0157
1. Data assimilation, statistical analysis, data mining, data extraction, scientific modeling and simulation, scientific workflow
Software: R and Pytion (Matplotlib) for visulaization. Custom software for processing
Analytics: Data assimilation, (Statistical) analysis, Data mining, Data extraction, Scientific modeling and simulation, Scientific workflow
Requirements:
1. needs to support diversified analytics tools
M0167
1. Radar signal processing in Matlab. Image analysis is MapReduce or MPI plus C/Java
2. Sophisticated signal processing, novel new image processing to find layers (can be 100’s one per year)
Software: Matlab for custom raw data processing. Custom image processing software. User Interface is a Geographical Information System
Analytics: Custom signal processing to produce Radar images that are analyzed by image processing to find layers
Requirements:
1. needs to support legacy software (Matlab) and language (C/Java) binding for processing
2. needs signal processing and advance image processing to find layers
M0127
1. see use case
2. see use case
3. see use case
4. see use case
Software: ROI_PAC, GeoServer, GDAL, GeoTIFF-supporting tools. Moving to Clouds
Analytics: Process Raw Data to get images which are run through image processing tools and accessed from GIS
Requirements:
1. needs to support geolocated data requires GIS integration of data as custom overlays
2. needs to support significant human intervention in data processing pipeline
3. needs to support host rich set of Radar image processing services
4. needs to support ROI_PAC, GeoServer, GDAL, GeoTIFF-supporting tools
M0182
1. see use case
Software: SGE Univa Grid Engine Version 8.1, iRODS version 3.2 and/or 3.3, IBM Global Parallel File System (GPFS) version 3.4, Cloudera version 4.5.2-1.
Analytics: Federation software
Requirements:
1. needs to support Climate Analytics as a Service on Clouds
M0129
1. see use case
Software: Cloudera, iRODS, Amazon AWS
Analytics: Climate Analytics-as-a-Service (CAaaS).
Requirements:
1. needs to support Climate Analytics as a Service on Clouds
M0090
1. MapReduce or the like, SciDB or other scientific database
2. Turbulence observations would be updated continuously
3. Event-specification language needed to perform data mining / event searches
4. Semantics (interpretation of multiple reanalysis products), data movement, database(s) with optimal structuring for 4-dimensional data mining.
Software: MapReduce or the like; SciDB or other scientific database.
Analytics: Data mining customized for specific event types
Requirements:
1. needs to support MapReduce, SciDB, and other scientific databases
2. needs to support continuously computing for updates
3. needs to support event-specification language for data mining and event searching
4. needs to support semantics interpretation and optimal structuring for 4-dimensional data mining and predictive analysis
M0186
1. see use case
Software: NCAR PIO library and utilities NCL and NCO, parallel NetCDF
Analytics: Need analytics next to data storage
Requirements:
1. needs to support ata analytics close to data storage
M0184
1. see use case
Software: EddyPro, Custom analysis software, R, python, neural networks, Matlab
Analytics: Data mining, data quality assessment, cross-correlation across datasets, data assimilation, data interpolation, statistics, quality assessment, data fusion
Requirements:
1. needs to support Custom Software: EddyPro, Custom analysis software, R, python, neural networks, Matlab
M0223
1. see use case
Software: R/Matlab, Weka, Hadoop. GIS based visualization
Analytics: Forecasting models, machine learning models, time series analysis, clustering, motif detection, complex event processing, visual network analysis
Requirements:
1. needs to support new machine learning analytics to predict consumption
Capability
General Requirement
1. needs to support legacy and advance software packages (subcomponent: SaaS)
(30: M0078, M0089, M0127, M0136, M0140, M0141, M0158, M0160, M0161, M0164, M0164, M0166, M0167, M0172, M0173, M0174, M0176, M0177, M0183, M0188, M0191, M0209, M0210, M0212, M0213, M0214, M0215, M0219, M0219, M0223)
2. needs to support legacy and advance computing platforms (subcomponent: PaaS)
(17: M0078, M0089, M0127, M0158, M0160, M0161, M0164, M0164, M0171, M0172, M0173, M0177, M0182, M0188, M0191, M0209, M0223)
3. needs to support legacy and advance distributed computing cluster, co-processors, I/O processing (subcomponent: IaaS)
(24: M0015, M0078, M0089, M0090, M0129, M0136, M0140, M0141, M0155, M0158, M0161, M0164, M0164, M0166, M0167, M0173, M0174, M0176, M0177, M0185, M0186, M0191, M0214, M0215)
4. needs to support elastic data transmission (subcomponent: networking)
(14: M0089, M0090, M0103, M0136, M0141, M0158, M0160, M0172, M0173, M0176, M0191, M0210, M0214, M0215)
5. needs to support legacy, large, and advance distributed data storage (subcomponent: storage)
(35: M0078, M0089, M0127, M0140, M0147, M0147, M0148, M0148, M0155, M0157, M0157, M0158, M0160, M0161, M0164, M0164, M0165, M0166, M0167, M0170, M0171, M0172, M0173, M0174, M0176, M0176, M0182, M0185, M0188, M0209, M0209, M0210, M0210, M0215, M0219)
6. needs to support legacy and advance programming executable, applications, tools, utilities, and libraries
(13: M0078, M0089, M0140, M0164, M0166, M0167, M0174, M0176, M0184, M0185, M0190, M0214, M0215)
M0147
1. 380 TB scanned documents at a centralized storage [380 TB]
Requirements:
1. needs to support large centralized storage (storage)
M0148
1. Hundred of Terabytes, and growing
2. NetApps, Hitachi, Magnetic tapes
Requirements:
1. needs to support large data storage
2. needs to support various storages such as NetApps, Hitachi, Magnetic tapes
M0219
1. see use case
Requirements:
1. needs to support software includes Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig
M0222
1. see use case
Requirements:
1. needs to support software includes Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig
M0161
1. Amazon EC2 with HDFS
2. Amazon S3
3. running Hadoop
4. Uses Scribe, Hive, Mahout, Python
5. 15 TB currently, growing about 1TB/month
6. Hadoop batch jobs are scheduled daily, but begun on real-time recommendation
Requirements:
1. needs to support EC2 with HDFS (infrastructure)
2. needs to support S3 (storage)
3. needs to support Hadoop (platform)
4. needs to support Scribe, Hive, Mahout, Python (language)
5. needs to support moderate storage (15 TB with 1TB/month) 6. needs to batch and real-time processing
M0164
1. Amazon Web Services AWS with Hadoop and Pig
2. Uses Cassandra NoSQL technology with Hive, Teradata
3. Summer 2012. 25 million subscribers, 4 million ratings per day
4. 3 million searches per day
5. 1 billion hours streamed in June 2012. Cloud storage 2 petabytes (June 2013)
6. Significant I/O intensive processing needed
Requirements:
1. needs to support Hadoop (platform)
2. needs to support Pig (language)
3. needs to support Cassandra and Hive
4. needs to support huge subscribers, ratings, and searching per day (DB)
5. needs to support huge storage (2 PB)
6. needs to support I/O intensive processing
M0165
1. Inverted Index not huge, crawled documents are petabytes of text – rich media much more
Requirements:
1. needs to support petabytes of text and rich media (storage)
M0137
1. see use case
2. see use case
Requirements:
1. needs to support Hadoop
2. needs to support commercial cloud services
M0103
1. LAN/T1/Internet Web Pages
Requirements:
1. needs to support Internet connectivity
M0176
1. Hopper.nersc.gov (150K cores)
2. GPFS
3. MongoDB
4. 10Gb
5. PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, varied community codes
6. 100TB (current), 500TB within 5 years
7. Scalable key-value and object store databases needed
8. Data streams from simulation at centralized peta/exascale systems
Requirements:
1. needs to support massive (150K cores) of legacy infrastructure (infrastructure)
2. needs to support GPFS (General Parallel File Sysem) (storage)
3. needs to support MonogDB systems (platform)
4. needs to support 10Gb networking
5. needs to support various of analytic tools such as PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, varied community codes
6. needs to support large storage (storage)
7. needs to support scalable key-value and object store (platform)
8. needs to support data streams from peta/exascale centralized simulation systems
M0213
1. see use case
Requirements:
1. needs to support software includes Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver
M0214
1. see use case
2. see use case
3. see use case
Requirements:
1. needs to support wide range custom software and tools including traditional RDBM’s and display tools.
2. needs to support several network requirements
3. needs to support GPU Usage important
M0215
1. see use case
2. see use case
3. see use case
Requirements:
1. needs to support tolerance of Unreliable networks to warfighter and remote sensors
2. needs to support up to 100.s PB.s data supported by modest to large clusters and clouds
3. needs to support software includes Hadoop, Accumulo (Big Table), Solr, NLP (several variants), Puppet (for deployment and security), Storm, Custom applications and visualization tools
M0177
1. Capabilities / SW: Hadoop, Hive, R. Unix-based
2. Capabilities / Compute: Cray supercomputer
3. Capabilities / Storage & SW: Teradata, PostgreSQL, MongoDB
4. Capabilties / NW: Various, with significant I/O intensive processing
Requirements:
1. needs to support Hadoop, Hive, R. Unix-based
2. needs to support Cray supercomputer
3. needs to support Teradata, PostgreSQL, MongoDB
4. needs to support various, with significant I/O intensive processing
M0089
1. Supercomputers, Cloud
2. SAN or HDFS with 1GB raw image data + 1.5GB analytical results per 2D image, 1TB raw image data + 1TB analytical results per 3D image. 1PB data per moderated hospital per year
3. Need excellent external network link
4. MPI for image analysis, MapReduce + Hive with spatial extension
Requirements:
1. needs to support legacy system and cloud (computing cluster)
2. needs to support huge legacy and new storage such as SAN or HDFS (storage)
3. needs to support high throughput network link (networking)
4. needs to support MPI image analysis, MapReduce, Hive with spatial extension (sw pkgs)
M0191
1. ImageJ, OMERO, VolRover, advanced segmentation and feature detection methods from applied math researchers…. Scalable key-value and object store databases needed.
2. 150K cores, refer to Hopper.nersc.gov
3. Database and image collections
4. 10Gb, could use 100Gb and advanced networking (SDN) later
Requirements:
1. needs to support ImageJ, OMERO, VolRover, advanced segmentation and feature detection methods from applied math researchers.... Scalable key-value and object store databases needed.
2. needs to support NERSC.s Hopper infrastructure
3. needs to support database and image collections.
4. needs to support 10 GB and future 100 GB and advanced networking (SDN).
M0078
1. 72-core cluster for our NIST group, collaboration with >1000 core clusters at FDA, some groups are using cloud
2. 40TB NFS is full, will need >100TB in 1-2 years at NIST
Healthcare community will need many PBs of storage
3. Open-source sequencing bioinformatics software from academic groups (UNIX-based)
Requirements:
1. needs to support legacy computing cluster and other PaaS and IaaS (computing cluster)
2. needs to support huge data storage in PB range (storage)
3. needs to support Unix-based legacy sequencing bioinformatics software (sw pkg)

M0188
1. 50 TB
2. needs scalable RDBMS for heterogeneity biological data
3. real time rapid and parallel bulk loading
4. Oracle RDBMS, SQLite files, flat text files, Lucy (a version of Lucene) for keyword searches, BLAST databases, USEARCH databases
5. Linux cluster, Oracle RDBMS server, large memory machines, standard Linux interactive hosts
Requirements:
1. needs to support huge data storage
2. needs to support scalable RDMS for heterogeneity biological data
3. needs to support real-time rapid and parallel bulk loading
4. needs to support Oracle RDBMS, SQLite files, flat text files, Lucy (a version of Lucene) for keyword searches, BLAST databases, USEARCH databases
5. needs to support Linux cluster, Oracle RDBMS server, large memory machines, standard Linux interactive hosts
M0140
1. Mayo internal data warehouse called Enterprise Data Trust (EDT), Open source Hbase
2. supercomputers, cloud and parallel computing
3. significant I/O intensive processing needed
4. HDFS storage
5. Custom code to develop new patient properties from stored data.
Requirements:
1. needs to support data warehouse, open source indexed Hbase
2. needs to support supercomputers, cloud and parallel computing
3. needs to support I/O intensive processing
4. needs to support HDFS storage
5. needs to support custom code to develop new properties from stored data.
M0174
1. Backend data in database or NoSQL stores
2. Cloud and parallel computing
3. A high performance computer (48 GB RAM) is needed to run the code for a few hundred patients. Clusters for large datasets
4. Clusters for large datasets
5. A 200 GB – 1 TB hard drive typically stores the test data. The relevant data is retrieved to main memory to run the algorithms. Backend data in database or NoSQL stores
Requirements:
1. needs to support Java, some in house tools, [relational] database and NoSQL stores
2. needs to support cloud and parallel computing
3. needs to support high performance computer, 48 GB RAM [to performa analysis for a moderate sample size]
4. needs to support clusters for large datasets
5. needs to support 200 GB - 1 TB hard drive for test data
M0172
1. Would require very large amount of movement of data to enable visualization
2. Distributed (MPI) based simulation system
3. MPI written in Charm++
4. Network file system using database driven techniques
5. infiniband for high bandwidth 3D Torus
Requirements:
1. needs to support movement of very large amount of data for visualization (networking)
2. needs to support distributed MPI-based simulation system (platform)
3. needs to support Charm++ on multi-nodes (software)
4. needs to support network file system (storage)
5. needs to support infiniband network (networking)
M0173
1. Provide a computing infrastructure that models social contagion processes: enables to capture different types of human-to-human interactions such as voice unhappiness with government leadership, of peaceful demonstrations, violent protests
2. File servers (including archives), databases
3. Ethernet, Infiniband, and similar
4. Specialized simulators, open source software, and proprietary modeling environments
5. account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are
Requirements:
1. needs to support computing infrastructure which can capture human-to-human interactions on various social events via the Internet (infrastructure)
2. needs to support file servers and databases (platform)
3. needs to support Ethernet and Infiniband networking (networking)
4. needs to support specialized simulators, open source software, and proprietary modeling (application)
5. needs to support huge user accounts across country boundaries (networking)
M0141
1. see use case
2. see use case
Requirements:
1. needs to support expandable on-demand based storage resource for global users
2. needs to support cloud community resource required
M0136
1. see use case
2. see use case
3. see use case
Requirements:
1. needs to support GPU
2. needs to support high performance MPI and HPC Infiniband cluster
3. needs to support libraries for single-machine or single-GPU computation are available (e.g., BLAS, CuBLAS, MAGMA, etc.), distributed computation of dense BLAS-like or LAPACK-like operations on GPUs remains poorly developed. Existing solutions (e.g., ScaLapack for CPUs) are not well-integrated with higher level languages and require low-level programming which lengthens experiment and development time.
M0171
1. see use case
Requirements:
1. needs to support Hadoop or enhanced MapReduce
M0160
1. Need to move towards Hadoop/HDFS
2. running IndexedHBase
3. Hive, SciPy, NumPy, MPI, Redis as a in-memory database as a buffer for real-time analysis
4. 10GB/Infiniband required
Requirements:
1. needs to support Hadoop and HDFS (platform)
2. needs to support IndexedHBase, Hive, SciPy, NumPy (software)
3. needs to support in-memory database, MPI (platform)
4. needs to support high-speed Infiniband network (networking)
M0158
1. 628 TB GPFS, Can be hundreds of GB for a single network
2. Internet, infiniband. A loose collection of supercomputing resources
3. A high performance computing cluster (DELL C6100), named Shadowfax, of 60 compute nodes and 12 processors (Intel Xeon X5670 2.93GHz) per compute node with a total of 720 processors and 4GB main memory per processor
4. EC2 based clouds are also used
5. Graph libraries: Galib, NetworkX, Distributed Workflow Management: Simfrastructure, databases, semantic web tools
Requirements:
1. needs to support large file system (storage)
2. needs to support various network connectivity (networking)
3. needs to support existing computing cluster
4. needs to support EC2 computing cluster
5. needs to support various graph libraries, management toos, databases, semantic web tools
M0190
1. see use case
Requirements:
1. needs to support PERL, Python, C/C++, Matlab, R development tools. Create ground-up test and measurement applications
M0130
1. see use case
2. see use case
Requirements:
1. needs to support iRODS data management software
2. needs to support interoperability across Storage and Network Protocol Types
M0163
1. see use case
Requirements:
1. needs to support software: Symfony-PHP, Linux, MySQL
M0131
1. see use case
2. see use case
Requirements:
1. needs to support cloud community resource required
M0189
1. see use case
Requirements:
1. needs to support high volume data transfer to remote batch processing resource
M0185
1. see use case
2. see use case
Requirements:
1. needs to support MPI, OpenMP, C, C++, F90, FFTW, viz packages, python, FFTW, numpy, Boost, OpenMP, ScaLAPCK, PSQL & MySQL databases, Eigen, cfitsio, astrometry.net, and Minuit2
2. needs to support supercomputer I/O subsystem limitations must be addressed
M0209
1. see use case
2. see use case
3. see use case
Requirements:
1. needs to support standard astrophysics reduction software as well as Perl/Python wrapper scripts
2. needs to support Oracle RDBMS, Postgres psql, as well as GPFS and Lustre file systems and tape archives
3. needs to support parallel image storage
M0166
1. legacy system running 200,000 cores “continuously”
2. mainly distributed cached files
3. object databases (Objectivity) were explored
Requirements:
1. needs to support legacy computing infrastructure (computing nodes)
2. needs to support distributed cached files (storage)
3. needs to support object databases (sw pkg)
M0210
1. see use case
2. see use case
3. see use case
4. see use case
Requirements:
1. needs to support needs to support 120PB Raw data
2. needs to support International distributed computing model to augment that at acceleartor (Japan)
3. needs to support data transfer of ~20Gbps at designed luminosity between Japan and US
4. needs to support software from Open Science Grid, Geant4, DIRAC, FTS, Belle II framework
M0155
1. see use case
Requirements:
1. needs to support architecture compatible with ENVRI Environmental Research Infrastructure collaboration
M0157
1. Different research infrastructures are designed for different purposes and evolve over time. The designers describe their approaches from different points of view, in different levels of detail and using different typologies
2. architectures, metadata frameworks, data discovery in scattered repositories, visualization and data curation
Requirements:
1. needs to support variety of computing infrastructures and architectures (infrastructure)
2. needs to support scattered repositories (storage)
M0167
1. ~40 TB removable disk array
2. Data accumulated in ~100 TB chunks for each expedition
3. ~0.5 Petabytes/year raw data, Image analysis is MapReduce or MPI plus C/Java
Requirements:
1. needs to support ~0.5 Petabytes/year of raw data
2. needs to support transfer content from removable disk to computing cluster for parallel processing
3. needs to support MapReduce or MPI plus language binding for C/Java
M0127
1. see use case
2. see use case
3. see use case
4. see use case
Requirements:
1. needs to support interoperable Cloud-HPC architecture should be supported
2. needs to support host rich set of Radar image processing services
3. needs to support ROI_PAC, GeoServer, GDAL, GeoTIFF-supporting tools
4. needs to support compatibility with other NASA Radar systems and repositories (Alaska Satellite Facility)
M0182
1. see use case
2. see use case
3. see use case
Requirements:
1. needs to support Support virtual Climate Data Server (vCDS)
2. needs to support GPFS Parallel File System integrated with Hadoop
3. needs to support iRODS
M0129
1. see use case
2. see use case
3. see use case
Requirements:
1. needs to support NetCDF aware software
2. needs to support MapReduce
3. needs to support Interoperable Use of Amazon AWS and local clusters
M0090
1. NASA Earth Exchange (NEX) - Pleiades supercomputer
2. Re-analysis datasets are likely to be too large to relocate to the supercomputer of choice (in this case NEX), therefore the fastest networking possible would be needed
Requirements:
1. needs to support other legacy computing systems (e.g. supercomputer)
2. needs to support high throughput data transmission over the network
M0186
1. see use case
Requirements:
1. needs to support extend architecture to several other fields
M0183
1. see use case
Requirements:
1. needs to support postgres, HDF5 data technologies and many custom software systems
M0184
1. see use case
2. see use case
Requirements:
1. needs to support custom software like EddyPro and analysis software like R, python, neural netowrks, Matlab
2. needs to support analytics includes data mining, data quality assessment, cross-correlation across datasets, data assimilation, data interpolation, statistics, quality assessment, data fusion, etc.
M0223
1. see use case
2. see use case
Requirements:
1. needs to support SQL databases, CVS files, HDFS (platform)
2. needs to support R/Matlab, Weka, Hadoop (platform)
Data Consumer
General Requirement
1. needs to support fast search (~0.1 seconds) from processed data with high relevancy, accuracy, and high recall
(4: M0148, M0160, M0165, M0176)
2. needs to support diversified output file formats for visualization, rendering, and reporting
(16: M0078, M0089, M0090, M0157, M0161, M0164, M0164, M0165, M0166, M0166, M0167, M0167, M0174, M0177, M0213, M0214)
3. needs to support visual layout for results presentation
(2: M0165, M0167)
4. needs to support rich user interface for access using browser, visualization tools
(11: M0089, M0127, M0157, M0160, M0162, M0167, M0167, M0183, M0184, M0188, M0190)
5. needs to support high resolution multi-dimension layer of data visualization
(21: M0129, M0155, M0155, M0158, M0161, M0162, M0171, M0172, M0173, M0177, M0179, M0182, M0185, M0186, M0188, M0191, M0213, M0214, M0215, M0219, M0222)
6. needs to support streaming results to clients
(1: M0164)
M0148
1. Search results should have high relevancy and high recall
2. Categorization of records should be highly accurate
3. NetApps, Hitachi, Magnetic tapes
Requirements:
1. needs to support high relevancy and high recall from search
2. needs to support high accuracy from categorization of records
3. needs to support various storages such as NetApps, Hitachi, Magnetic tapes
M0219
1. see use case
Requirements:
1. needs to support data visualization for data review, operational activity and general analysis. It continues to evolve.
M0222
1. see use case
Requirements:
1. needs to support data visualization for data review, operational activity and general analysis. It continues to evolve.
M0161
1. custom built reporting tools for aggregating readership and social activities per document
2. Network visualization via Gephi, scatterplots of readership vs. citation rate, etc.
Requirements:
1. needs to support custom built reporting tools
2. needs to support visualization tools such as networking graph, scatterplots, etc.
M0164
1. Streaming media
Requirements:
1. needs to support streaming and rendering media??
M0165
1. Return in ~0.1 seconds
2. number of great responses in top 10 ranked results
3. Page layout is critical
Requirements:
1. needs to support search time in ~0.1 seconds
2. needs to support top 10 ranked results
3. needs to support page layout (visual)
M0162
1. Important for materials discovery. Potentially important to understand the dependency of properties on the many independent variables. Virtually unaddressed
2. 4 Multi-variable materials data visualization tools, in which the number of variables can be quite high
Requirements:
1. needs to support visualization for materials discovery from many independent variables
2. needs to support visualization tools for multi-variable materials
M0176
1. Materials browsers as data from search grows
Requirements:
1. needs to support browser-based to search growing material data
M0213
1. see use case
Requirements:
1. needs to support visualization with GIS at high and low network bandwidths and on dedicated facilities and handhelds
M0214
1. see use case
2. see use case
Requirements:
1. needs to support visualization of extracted outputs will typically be as overlays on a geospatial display. Overlay objects should be links back to the originating image/video segment.
2. needs to support output the form of OGC compliant web features or standard geospatial files (shape files, KML).
M0215
1. see use case
Requirements:
1. needs to support primary visualizations will be Geospatial overlays (GIS) and network diagrams.
M0177
1. see use case
Requirements:
1. needs to provide results of analytics for use by data consumers / stakeholders - ie, those who did not actually perform the analysis. Specific visualization techniques
M0089
1. Visualization is needed for validation and training
Requirements:
1. needs to support visualization for validation and training
M0191
1. Heavy use of 3D structural models.
Requirements:
1. needs to support 3D structural modeling
M0078
1. Genome browsers have been developed to visualize processed data
Requirements:
1. needs to support data format for Genome browsers
M0188
1. real time interactive parallel bulk loading capable system
2. interactive Web UI with core data, backend precomputations, batch job computation submission from the UI.
3. Ability to download large amounts of data for offline analysis is another requirement of the system.
4. Web UI’s still seem to be the preferred interface for most biologists. It is used for basic querying and browsing of data.
5. The less quantitative part includes the ability to visualize structural details at different levels of resolution. … more abstract representations such as representing a group of highly similar genomes in a pangenome
Requirements:
1. needs to support real time interactive parallel bulk loading capability
2. needs to support interactive Web UI, backend precomputations, batch job computation submission from the UL
3. needs to support download assembled and annotated datasets for offline analysis.
4. needs to support ability to query and browse data via interactive Web UI.
5. needs to support visualize structure [of data] at different levels of resolution. Ability to view abstract representations of highly similar data.
M0174
1. The visualization of the entire input data is nearly impossible. But typically, partially visualizable. The models built can be visualized under some reasonable assumptions.
Requirements:
1. needs to support visualization of subsets of very large data
M0172
1. Would require very large amount of movement of data to enable visualization
Requirements:
1. needs to support visualization
M0173
1. Large datasets, time evolution, multiple contagion processes over multiple network representations. Levels of detail (e.g., individual, neighborhood, city, state, country-level)
2. interactions. Visualization of results
Requirements:
1. needs to support multi-levels detail network representations
2. needs to support visualization with interactions
M0141
1. Requires advanced and rich visualization, high definition visualisation facilities, visualisation data
2. 4D visualization, Visualizing effects of parameter change in (computational) models. Comparing model outcomes with actual observations (multi dimensional)
Requirements:
1. needs to support advanced / rich / high definition visualization
2. needs to support 4D visualization
M0171
1. see use case
Requirements:
1. needs to support visualize large-scale 3-d reconstructions, and navigate large-scale collections of images that have been aligned to maps.
M0160
1. data retrieval, big data visualization, information diffusion, clustering, and dynamic network visualization capabilities already exist
2. data-interactive Web interfaces
3. public API for data querying
Requirements:
1. needs to support data retrieval and dynamic visualization
2. needs to support data driven interactive web interfaces
3. needs to support API for data query
M0158
1. As the input graph size grows the visualization system on client side is stressed heavily both in terms of data and compute
Requirements:
1. needs to support client side visualization
M0190
1. see use case
Requirements:
1. needs to support analytic flows involving users
M0130
1. see use case
Requirements:
1. needs to support general visulaization workflows
M0131
1. see use case
Requirements:
1. needs to support efficient data-graph based visualization is needed
M0170
1. see use case
Requirements:
1. needs to support visualization mechanisms for highly dimensional data parameter spaces
M0185
1. see use case
Requirements:
1. needs to support interpretation of results using advanced visualization techniques and capabilities
M0166
1. Modest use of visualization outside histograms and model fits
Requirements:
1. needs to support histograms and model fits (visual)
M0155
1. see use case
Requirements:
1. needs to suupport visualization of high (>=5) dimension data
M0157
1. graph plotting tools, Google Chart Tools
2. interactive time series line chart
3. browser using Flash
4. instance maps of European high-resolution
5.Visual tools for comparisons between products for high scientific quality
Requirements:
1. needs to support graph plotting tools
2. needs to support time-series interactive tools
3. needs to support brower-based flash playback
4. needs to support earth high-resolution map display
5. needs to support visual tools for quality comparisons
M0167
1. User Interface is a Geographical Information System
2. Rich user interface for layers and glacier simulations
Requirements:
1. needs to support GIS user interface
2. needs to support rich user interface for simulations
M0127
1. see use case
Requirements:
1. needs to suupport Support field expedition users with phone/tablet interface and low resolution downloads
M0182
1. see use case
Requirements:
1. needs to suupport visualize distributed heterogeneous data
M0129
1. see use case
Requirements:
1. needs to support high end distributed visualization
M0090
1. Useful for interpretation of results.
Requirements:
1. needs to support visualization to interpret results
M0186
1. see use case
2. see use case
Requirements:
1. needs to support share data with worldwide climate
2. needs to support high end distributed visualization
M0183
1. see use case
Requirements:
1. needs to support phone based input and access
M0184
1. see use case
Requirements:
1. needs to support phone based input and access
Security & Privacy
General Requirement
1. needs to support protect and preserve security and privacy on sensitive data
(32: M0078, M0089, M0103, M0140, M0141, M0147, M0148, M0157, M0160, M0162, M0164, M0165, M0166, M0166, M0167, M0167, M0171, M0172, M0173, M0174, M0176, M0177, M0190, M0191, M0210, M0211, M0213, M0214, M0215, M0219, M0222, M0223)
2. needs to support multi-level policy-driven, sandbox, access control, authentication on protected data
(13: M0006, M0078, M0089, M0103, M0140, M0161, M0165, M0167, M0176, M0177, M0188, M0210, M0211)
M0147
1. Title 13 data
Requirements:
1. needs to support Title 13 data
M0148
1. Security need to be more robust
Requirements:
1. needs to support security policy
M0219
1. see use case
2. see use case
Requirements:
1. needs to support improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.
2. needs to support both confidential and secure all data. All processes must be auditable for security and confidentiality as required by various legal statutes.
M0222
1. see use case
Requirements:
1. needs to support both confidential and secure on all data. All processes must be auditable for security and confidentiality as required by various legal statutes.
M0175
1. see use case
Requirements:
1. needs to support strong security and privacy constraints
M0161
1. Researchers often want to keep what they.re reading private, especially industry researchers, so the data about who.s reading what has access controls
Requirements:
1. needs to support access controls for who.s reading what content
M0164
1. Need to preserve privacy for users and digital rights for media
Requirements:
1. needs to support preservation of users. privacy and digital rights for media
M0165
1. Exact results not essential but important to get main hubs and authorities for search query
2. Need to be sensitive to crawling restrictions. Avoid Spam results
Requirements:
1. needs to support access control
2. needs to protect sensitive content
M0137
1. see use case
Requirements:
1. needs to support strong security for many applications
M0103
1. Security need to be more robust
Requirements:
1. needs to support security policy
M0162
1. Proprietary nature of many data very sensitive
2. Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data
Requirements:
1. needs to support protection of proprietary of sensitive data
2. needs to support tools to mask proprietary information
M0176
1. Ability to “sandbox” or create independent working areas between data stakeholders
2. Policy-driven federation of datasets
Requirements:
1. needs to support sandbox as independent working areas between different data stakeholders
2. needs to support policy-driven federation of datasets
M0213
1. see use case
Requirements:
1. needs to support sensitive data and must be completely secure in transit and at rest (particularly on handhelds)
M0214
1. see use case
Requirements:
1. needs to support significant security and privacy, sources and methods cannot be compromised the enemy should not be able to know what we see.
M0215
1. see use case
Requirements:
1. needs to support data must be protected against unauthorized access or disclosure and tampering
M0177
1. Clinicians and public health officials leverage info & knowledge gained from integrated & standardized EMR data to support direct patient care and population health.
2. needs to support preserve privacy and confidentiality of individuals. health data, in compliance with Federal regulations [including HIPAA], state regulations [where applicable].
3. Protect individuals. health data in accordance with providers. documented privacy practices.
4. Developing analytic models using … clinical data requires aggregations and de-identification of individuals. health data. Ensure that analytic models do NOT access or expose personally identifiable health data.
5. Prevent any disclosure of patient health data, beyond disclosures permitted by statute or by providers. documented privacy practices. Prevent security breaches of protected health data.
Requirements:
1. needs to support data consumers may access data directly, AND refer to the results of analytics performed by informatics research scientists and health service researchers.
2. needs to support all health data is protected in compliance with governmental regulations.
3. needs to support protection of data in accordance with data providers. policies.
4. needs to support security and privacy policies may be unique to a subset of the data.
5. needs to support robust security to prevent data breaches.
M0089
1. Protected health information has to be protected, public data have to be de-identified
Requirements:
1. needs to support security and privacy protection for protected health information
M0191
1. see use case
Requirements:
1. needs to support significant but optional security & privacy including secure servers and anonymization
M0078
1. Sequencing data in health records or clinical research databases must be kept secure/private, though our Consortium data is public
Requirements:
1. needs to support security and privacy protection on health records and clinical research databases
M0188
1. Data is either public or requires standard login with passwords.
2. Per website, To submit a dataset to the system, user must create an account, and log in with username and password.
3. Per website, users may request a single sign on. account.
Requirements:
1. needs to support login security - username and password
2. needs to support creation of user account to submit and access dataset to system via web interface
3. needs to support single sign on capability (SSO)
M0140
1. Health records or clinical research databases must be kept secure/private.
2. Data access may differ based upon user role [physician, patient, …]
Requirements:
1. needs to support protection of health data in accordance with legal requirements - eg, HIPAA - and privacy policies.
2. needs to support security policies for different user roles.
M0174
1. Secure handling and processing of data is of crucial importance in medical domains.
Requirements:
1. needs to support secure handling and processing of data is of crucial importance in medical domains
M0172
1. Two dimensions. First, privacy and anonymity issues for individuals used in modeling (e.g., Twitter and Facebook users)
2. securing data and computing platforms for computation
Requirements:
1. needs to support protection of PII on individuals used in modeling
2. needs to support data protection and secure platform for computation
M0173
1. Two dimensions. First, privacy and anonymity issues for individuals used in modeling (e.g., Twitter and Facebook users)
2. securing data and computing platforms for computation.
Requirements:
1. needs to support protection of PII on individuals used in modeling
2. needs to support data protection and secure platform for computation
M0141
1. Federated identity management for mobile researchers and mobile sensors
2. access control and accounting for information on protected species, ecological information, space images, climate information.
Requirements:
1. needs to support Federated identity management for mobile researchers and mobile sensors
2. needs to support access control and accounting
M0171
1. see use case
Requirements:
1. needs to support preserve privacy for users and digital rights for media.
M0160
1. some policy for data storage security and privacy protection must be implemented
Requirements:
1. needs to support security and privacy policy
M0211
1. see use case
Requirements:
1. needs to support privacy issues in preserving anonymity of responses in spite of computer recording of access ID and reverse engineering of unusual user responses
M0190
1. see use case
Requirements:
1. needs to support security requirements for protecting sensitive data while enabling meaningful developmental performance evaluation. Shared evaluation testbeds must protect the intellectual property of analytic algorithm developers
M0130
1. see use case
2. see use case
Requirements:
1. needs to support Federate across existing authentication environments through Generic Security Service API and Pluggable Authentication Modules (GSI, Kerberos, InCommon, Shibboleth).
2. needs to support access controls on files independently of the storage location.
M0163
1. see use case
Requirements:
1. needs to support significant but optional security & privacy including secure servers and anonymization
M0189
1. see use case
Requirements:
1. needs to support multiple security & privacy requirements to be satisfied
M0166
1. keep experiment results confidential until verified and presented.
Requirements:
1. needs to support data protection
M0210
1. see use case
Requirements:
1. needs to support standard Grid authentication
M0157
1. Most of the projects follow the open data sharing policy with minor restrictions
Requirements:
1. needs to support open data policy with minor restrictions
M0167
1. Himalaya studies fraught with political issues and require UAV
2. Data itself open after initial study but could be sensitive later
Requirements:
1. needs to support security and privacy on political sensitive issues
2. needs to support dynamic security and privacy policy mechanisms
M0223
1. see use case
Requirements:
1. needs to support privacy and anonymization by aggregation
Lifecycle
General Requirement
1. needs to support data quality curation including pre-processing, data clustering, classification, reduction, format transformation
(20: M0141, M0147, M0148, M0157, M0160, M0161, M0162, M0165, M0166, M0167, M0172, M0173, M0174, M0177, M0188, M0191, M0214, M0215, M0219, M0222)
2. needs to support dynamic updates on data, user profiles, and links
(2: M0164, M0209)
3. needs to support data lifecycle and long-term preservation policy including data provenance
(6: M0141, M0147, M0155, M0163, M0164, M0165)
4. needs to support data validation
(4: M0090, M0161, M0174, M0175)
5. needs to support human annotation for data validation
(4: M0089, M0127, M0140, M0188)
6. needs to support prevention of data loss or corruption
(3: M0147, M0155, M0173)
7. needs to support multi-sites archival
(1: M0157)
8. needs to support persistent identifier and data traceability
(2: M0140, M0161)
9. needs to support standardize, aggregate, and normalize data from disparate sources
(1: M0177)
M0147
1. Maintain data “as-is”. No access and no data analytics for 75 years
2. Preserve the data at the bit-level
3. Perform curation, which includes format transformation if necessary
4. Provide access and analytics after nearly 75 years
5. cannot tolerate data loss
Requirements:
1. needs to support long-term preservation of data as-is for 75 years
2. needs to support long-term preservation at the bit-level
3. needs to support curation process including format transformation
4. needs to support access and analytics processing after 75 years
5. needs to make sure no data loss
M0148
1. Pre-process data for virus scan 2. identifying file format identification
3. Index
4. Categorize records (sensitive, unsensitive, privacy data, etc.)
5. Transform old file formats to modern formats (e.g. WordPerfect to PDF)
Requirements:
1. needs to support pre-process for virus scan
2. needs to support file format identification
3. needs to support indexing
4. needs to support categorize records
M0219
1. see use case
Requirements:
1. needs to support high veracity on data and systems must be very robust. The semantic integrity of conceptual metadata concerning what exactly is measured and the resulting limits of inference remain a challenge
M0222
1. see use case
Requirements:
1. needs to support high veracity on data and systems must be very robust. The semantic integrity of conceptual metadata concerning what exactly is measured and the resulting limits of inference remain a challenge
M0161
1. Metadata extraction from PDFs is variable
2. it’s challenging to identify duplicates
3. there’s no universal identifier system for documents or authors (though ORCID proposes to be this)
4. 90% correct metadata extraction according to comparison with CrossRef, PubMed, and Arxiv
Requirements:
1. needs to support metadata management from PDF extraction
2. needs to support identify of document duplication
3. needs to support persistent identifier
4. needs to support metadata correlation between data repositories such as CrossRef, PubMed and Arxiv
M0164
1. Media and Rankings continually updated
Requirements:
1. needs to support continued ranking and updating based on user profile and analytic results
M0165
1. Average page has life of a few months
2. A lot of duplication and spam
Requirements:
1. needs to support purge data after certain time interval (few months)
2. needs to support data cleaning
M0162
1. Except for fundamental data on the structural and thermal properties, data quality is poor or unknown. See Munro’s NIST Standard Practice Guide.
Requirements:
1. needs to support how to handle data quality is poor or unknown
M0176
1. Validation and UQ of simulation with experimental data of varied quality. Error checking and bounds estimation from simulation inter-comparison
2. UQ in results based on multiple datasets
Requirements:
1. needs to support validation and UQ of simulation with experimental data
2. needs to support UQ in results from multiple datasets
M0214
1. see use case
Requirements:
1. needs to support veracity of extracted objects
M0215
1. see use case
Requirements:
1. needs to support data provenance (e.g. tracking of all transfers and transformations) must be tracked over the life of the data.
M0177
1. Need advanced methods for normalizing patient, provider, facility and clinical concept identification within and among health care organizations
2. Veracity: Data commonly gathered using different methods and representations
results in heterogeneity and systematic erors and bias.
3. Quality (syntax): EMR data subject to highly variable names and codes for the same clinical tests or measurement. When integrating many data sources, mapping local terms to a common standardized concept using a combination of probabilistic and heuristic classification methods is necessary.
Requirements:
1. needs to support standardize, aggregate, and normalize data from disparate sources
2. needs to support the needs to reduce errors and bias
3. needs to support common nomenclature and classification of content across disparate sources. This is particularly challenging in the health IT space, as the taxonomies continue to evolve - SNOMED, ICD 9 and future ICD 10, etc.
M0089
1. High quality results validated with human annotations are essential
Requirements:
1. needs to support human annotations for validation
M0191
1. Workflow components include data acquisition, storage, enhancement, minimizing noise
Requirements:
1. needs to support workflow components include data acquisition, storage, enhancement, minimizing noise
M0188
1. Improving quality of metagenomic assembly is still a fundamental challenge. Improving the quality of reference isolate genomes, both in terms of the coverage in the phylogenetic tree, improved gene calling and functional annotation is a more mature process, but an ongoing project.
2. Data clustering and classification … Data reduction, removing redundancies through clustering, …
3. Through regular content updates, IMG aims at providing high levels of genome diversity, in terms of the number of genomes integrated into the system, annotation coverage, in terms of the breadth of the functional annotations
and data quality, in terms of the coherence of annotations.
Requirements:
1. needs to support methods to improve data quality required.
2. needs to support data clustering, classification, reduction.
3. needs to support Integrate new data / content into the system.s data store
annotate data
M0140
1. Data are annotated based on domain ontologies or taxonomies. Semantics of data can vary from labs to labs
2. Provenance is important to trace the origins of the data and data quality. Semantics vary based on source.
3. Stage 1: Use the Semantic Linking for Property Values method to convert an existing data warehouse at Mayo Clinic, called the Enterprise Data Trust (EDT), into RDF triples
Requirements:
1. needs to support data annotated based on domain ontologies or taxonomies.
2. needs to support to ensure traceability of data, from origin [initial point of collection] through to use.
3. needs to support convert data from existing data warehouse into RDF triples
M0174
1. typically multiple tables need to be merged in order to perform the analysis
2. Challenging due to ,,, human errors in data collection and validation
Requirements:
1. needs to support merging multiple tables before analysis
2. needs to support methods to validate data to minimize errors
M0172
1. Robustness of the simulation is dependent upon the quality of the model. However, robustness of the computation itself, although non-trivial, is tractable
Requirements:
1. needs to support data quality and be able capture the traceability of quality from computation
M0173
1. Data fusion a big issue. How to combine data from different sources and how to deal with missing or incomplete data? Multiple simultaneous contagion processes
2. Checks for ensuring data consistency, corruption
3. Preprocessing of raw data for use in models
Requirements:
1. needs to support data fusion from variety of dta sources
2. needs to support data consistency and no corruption
3. needs to support preprocessing of raw data
M0141
1. Data storage and archiving, data exchange and integration
2. data linkage: from the initial observation data to processed data and reported/visualised data…. Data lifecycle management: data provenance, referral integrity and identification
3. Processed (secondary) data serving as input for other researchers
4. Provenance (and persistent identification (PID)) control of data, algorithms, and workflows
5. Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
6. Some biodiversity research are critical to data veracity (reliability/trustworthiness). In case of natural and technogenic disasters data veracity is critical.
Requirements:
1. needs to support data storage and archiving, data exchange and integration
2. needs to support data lifecycle management: data provenance, referral integrity and identification
traceability back to initial observational data
3. needs to support [In addition to original source data,] processed (secondary) data may be stored for future uses
4. needs to support provenance (and persistent identification (PID)) control of data, algorithms, and workflows
5. needs to support curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
M0160
1. Data structured in standardized formats, the overall quality is extremely high. We generate aggregated statistics, expand the features set, etc., generating high-quality derived data
Requirements:
1. needs to support standardized data structured/formats with extremely high data quality
M0163
1. see use case
Requirements:
1. needs to support integration of metadata approaches across disciplines
M0209
1. see use case
Requirements:
1. needs to support links between remote telescopes and central analysis sites
M0166
1. Huge effort to make certain complex apparatus well understood and corrections properly applied to data. Often requires data to be re-analyzed
Requirements:
1. needs to support data quality on complex apparatus
M0155
1. see use case
Requirements:
1. needs to support preservation of data and avoid lost data due to instrument malfunction
M0157
1. data quality is highly important
2. Data staging to mirror archives
3. metadata frameworks
4. data discovery in scattered repositories, visualization and data curation
Requirements:
1. needs to support high quality on data
2. needs to support mirror archives
3. needs to support various metadata frameworks
4. needs to support scattered repositories and data curation
M0167
1. Main engineering issue is to ensure instrument gives quality data
Requirements:
1. needs to support data quality assurance
M0127
1. see use case
2. see use case
Requirements:
1. needs to support significant human intervention in data processing pipeline
2. needs to support rich robust provenance defining complex machine/human processing
M0090
1. Validation would be necessary for the output product (correlations)
Requirements:
1. needs to support validation for output products (correlations)
Others
General Requirement
1. needs to support rich user interface from mobile platforms to access processed results
(6: M0078, M0127, M0129, M0148, M0160, M0164)
2. needs to support performance monitoring on analytic processing from mobile platforms
(2: M0155, M0167)
3. needs to support rich visual content search and rendering from mobile platforms
(13: M0078, M0089, M0161, M0164, M0165, M0166, M0176, M0177, M0183, M0184, M0186, M0219, M0223)
4. needs to support mobile device data acquisition
(1: M0157)
5. needs to support security across mobile devices
(1: M0177)
M0148
1. Mobile search must have similar interfaces/results
Requirements:
1. needs to support mobile search with similar interfaces/results from desktop
M0219
1. see use case
Requirements:
1. needs to support mobile access
M0175
1. see use case
Requirements:
1. needs to support mobile access
M0161
1. Delivering content and services to various computing platforms from Windows desktops to Android and iOS mobile devices
Requirements:
1. needs to support Windows Android and iOS mobile devices for content deliverables from Windows desktops
M0164
1. user experience on mobile phones
Requirements:
1. needs to support smart interface accessing movie content on mobile platforms
M0165
1. Mobile search must have similar interfaces/results
Requirements:
1. needs to support mobile search and rendering
M0176
1. Potential exists for widespread delivery of actionable knowledge in materials science. Many materials genomics “apps” are amenable to a mobile platform.
Requirements:
1. needs to support mobile apps to access materials geonics information
M0177
1. Mobile access is a requirement
Requirements:
1. needs to support security across mobile devices.
M0089
1. 3D visualization of 3D pathology images is not likely in mobile platforms
Requirements:
1. needs to support 3D visualization and rendering on mobile platforms
M0078
1. Physicians may need access to genomic data on mobile platforms
Requirements:
1. needs to support mobile platforms for physicians accessing genomic data (mobile device)
M0140
1. Physicians and patient may need access to this data on mobile platforms
Requirements:
1. needs to support mobile access
M0173
1. How and where to perform these computations? Combinations of cloud computing and clusters. How to realize most efficient computations, move data to compute resources?
Requirements:
1. needs to support efficient method of moving data
M0141
1. support … mobile researchers (both for information feed and catalogue search)
Requirements:
1. needs to support access by mobile users
M0160
1. Implementing low-level data storage infrastructure features to guarantee efficient, mobile access to data
Requirements:
1. needs to support low-level data storage infrastructure for efficient mobile access to data
M0155
1. see use case
Requirements:
1. needs to suupport realtime monitoring of equipment by partial streaming analysis
M0157
1. need for efficient and high performance mobile detectors, submersible robots, mobile instruments
Requirements:
1. needs to support various kind of mobile sensor devices for data acquisition
M0167
1. Essential to monitor field data and correct instrumental problems. Implies must analyze fully portion of data in field
Requirements:
1. needs to support monitoring data collection instruments/sensors
M0127
1. see use case
Requirements:
1. needs to suupport Support field expedition users with phone/tablet interface and low resolution downloads
M0129
1. see use case
2. see use case
Requirements:
1. needs to support Smart phone and Tablet access required
2. needs to support iRODS data management
M0186
1. see use case
Requirements:
1. needs to support phone based input and access
M0183
1. see use case
Requirements:
1. needs to support phone based input and access
M0184
1. see use case
Requirements:
1. needs to support phone based input and access
M0223
1. see use case
Requirements:
1. needs to support mobile access for clients

 

Upcoming/Past Events

2nd NIST Big Data Workshop, NIST,
June 1 & 2, 2017
IEEE NBD-PWG Workshop, October 27, 2014
1st NIST Big Data Workshop, NIST, September 30, 2013

NIST is an agency of the
U.S. Commerce Department

Please send feedback to: bigdatainfo@nist.gov
Last updated: January 11, 2017

Privacy Police | Security Notices FOIA
Accessibility Statement | Disclaimer