3rd IEEE Big Data Governance and Metadata and Management Workshop (BDMM 2018)

 

big data header

IEEE Big Data Governance and Metadata and Management 

(BDMM 2018)

Seattle, Washington, USA

December 10 - 11, 2018

In conjunction with
IEEE Big Data 2018

Sponsored by
IEEE Brain Initiative
IEEE Big Data Technical Community
IEEE Standards Association (IEEE-SA)

Registration: Coming Soon!

 

Motivations

This workshop is aligned with the effort from the IEEE Big Data Technical Community (BDTC) on Standardization (see http://bigdata.ieee.org/). The BDTC standards research group is studying on where there is a need and opportunity for developing IEEE Standards for Big Data Metadata, its management, and governance. BDMM 2018 follows the successful BDMM workshop held at IEEE BigData 2018 (http://cci.drexel.edu/bigdata/bigdata2018).

Big Data is a collection of data so large, so complex, so distributed, and growing so fast (or 5Vs- volume, variety, velocity, veracity, and value). It has been known for unlocking new sources of economic values, providing fresh insights into sciences, and assisting on policy making. However, Big Data is not practically consumable until it can be aggregated and integrated into a manner that a computer system can process. For instance, in the Internet of Things (IoT) environment, there is a great deal of variation in the hardware, software, coding methods, terminologies and nomenclatures used among the data generation systems. Given the variety of data locations, formats, structures and access policies, data aggregation has been extremely complex and difficult. More specifically, a health researcher was interested in finding answers to a series of questions, such as “How is the gene ‘myosin light chain 2’ associated with the chamber type hypertrophic cardiomyopathy? What is the similarity to a subset of the genes’ features? What are the potential connections among pairs of genes”? To answer these questions, one may retrieve information from databases he knows, such as the NCBI Gene database or PubMed database. In the Big Data era, it is highly likely that there are other repositories also storing the relevant data. Thus, we are wondering

  • Is there an approach to manage such big data, so that a single search engine available to obtain all relevant information drawn from a variety of data sources and to act as a whole?
  • How do we know if the data provided is related to the information contained in our study?
To achieve this objective, we need a mechanism to help us describe a digital source so well that allows it to be understood by both human and machine. Metadata is "data about data". It is descriptive information about a particular dataset, object or resource, including how it is formatted, and when and by whom it is collected. With those information, the finding of and the working with particular instances of Big Data would become easier. Besides, the Big Data must be managed effectively. This has partially manifested in data models a.k.a. “NoSQL”. The goal of this multidisciplinary workshop is to gather both researchers and practitioners to discuss methodological, technical and standard aspects for Big Data management. Papers describing original research on both theoretical and practical aspects of metadata for Big Data management are solicited.

 

Topics

Topics include, but are not limited to:
  • Metadata standard(s) development for Big Data management
  • Methodologies, architecture and tools for metadata annotation, discovery, and interpretation
  • Case study on metadata standard development and application
  • Metadata interoperability (crosswalk)
  • Metadata and Data Privacy
  • Metadata for Semantic Webs
  • Human Factors on Metadata
  • Innovations in Big Data management
  • Opportunities in standardizing Big Data management
  • Digital object architectures and infrastructures for Big Data management
  • Best practices and standard based persistent identifiers, data types registry structures and representations for Big Data management
  • Query languages and ontology in Big Data
  • NoSQL databases and Schema-less data modeling
  • Multimodal resource and workload management
  • Availability, reliability and Fault tolerance
  • Frameworks for parallel and distributed information retrieval
  • Domain standardization for Big Data management
  • Big Data governance for data integrity, quality, provenance, retention, asset management, and business intelligence
In addition to the accepted papers, the workshop intends to have an industry focus through a keynote speaker and hackathon challenges. The hackathon session will explore interoperable data infrastructure for Big Data Governance and Metadata Management that is scalable and can enable the Findability, Accessibility, Interoperability, and Reusability between heterogeneous datasets from various domains without worrying about data source and structure.

 

Paper submission instructions 

This workshop will only accept for review original papers that have not been previously published. Papers should be formatted based on the IEEE Transactions journals and conferences style; maximum allowed camera-ready paper length is ten (10) pages. Submissions must use the followiing formatting instructions:
8.5" x 11" x 2 (DOC, PDF, LaTex Formatting Macros)

Please use this submission site to submit your paper(s).

Accepted papers will be published in the IEEE BigData2018 proceedings (EI indexed). For further information please see IEEE BigData2018 @ http://cci.drexel.edu/bigdata/bigdata2018.

Review procedure 

All submitted papers will be reviewed by 3 international program committees.

 

Hackathon: 24 hours on Data Mashup (Varieties Problem) Big Data Analytics  

Governance and metadata management poses unique challenges with regard to the Big Data paradigm shift. It is critical to develop interoperable data infrastructure for Big Data Governance and Metadata Management that is scalable and can enable the Findability, Accessibility, Interoperability, and Reusability between heterogeneous datasets from various domains without worrying about data source and structure.

Hackathon Track#1: Personalized Medicine for Drug Targeting in Prostate Cancer Patients
Submitted and Subject Matter Expert by Dr. Elizabeth Chang
Research Fellow, Department of Radiation Oncology, University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, USA

Problem Statement

Personalized medicine is the act of tailoring chemotherapy or drugs based on a patient’s specific set of DNA or genes. When a person is diagnosed with cancer, a variety of tests are performed (blood, DNA, urine, or tissue analysis), giving physicians a snapshot into that patient's unique set of DNA. This information allows for "smart" prescribing of medications that complement a patient's signature genetic background and achieve therapeutic response.

      

Problem: how do we find these biomarkers?

One approach: NCI’s Genomic Data Commons data portal is a huge data repository of over 32,000 patient cases, and includes clinical data, treatment data, or biopsy results, and over 22,000 genes, as well as a whole host of other information. This allows accessibility to other researchers who want to uncover new biomarkers, find correlations between genes and survival, or look into whatever topic they are interested in.

The President's Council of Advisors on Science and Technology (PCAST) believes that the convergence of scientific and clinical opportunity and public health need represented by personalized medicine warrants significant public and private sector action to facilitate the development and introduction into clinical practice of this promising class of new medical products… Based on these deliberations, PCAST determined that specific policy actions in the realm of genomics-based molecular diagnostics had the greatest potential to accelerate progress in personalized medicine. PCAST on Priorities for Personalized Medicine, September 2008

Tutorial and Hands-on (No biooscience background is needed but willing to work within a team is preferred)

Subject Matter Expert: Provide Genomic Data Commons data portal overview and hands-on excerise on given datasets

  • Dr. Elizabeth Chang, Research Fellow, Department of Radiation Oncology, University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, USA

Challenges
Note: Not all data values may be present, and some patients may have multiple records
(Click here for step-by-step instructions using Xena Browser to check questions 2 and 3)

Develop data mashup scheme based on use cases to cross reference different clinical and genomic datasets and apply statistical analysis, visualization, and machine learning tools to statistically analyze and develop predictive models for survival data, uncover new molecular biomarkers, and find correlations between genes and cancer risk. Think outside the box and come up with innovative ideas that bring more value out of the data, or choose one or more of the following to:

  1. Find how many patient cases are “Primary Tumor” samples from Dataset 2 using Column A “sample_type”.
  2. Graph overall survival (OS) by Gleason score from Dataset 2 (using Kaplan Meier plot, see Computing Environment for reference)
  3. Graph overall survival by target gene TP53 when Gleason score is 6
  4. Graph overall survival by target gene TP53 when Gleason score is categorized into 3 groups (6-7, 8, and 9-10)
  5. Repeat #4 using your own Gleason score categories which produces the best P-value (example: P-value <0.05)
[Ultimate goal: Repeat #5, this time using all the genes in Dataset 1, and find the ones which produce a P-value <0.05]

Computing Environment 

Datasets 

Two public available datasets shall be used:
  1. Dataset 1: gene expression RNAseq – IlluminaHiSeq
    https://tcga.xenahubs.net/download/TCGA.PRAD.sampleMap/HiSeqV2.gz (570KB)

    The gene expression profile was measured experimentally using the Illumina HiSeq 2000 RNA Sequencing platform by the University of North Carolina TCGA genome characterization center. Level 3 data was downloaded from TCGA data coordination center. This dataset shows the gene-level transcription estimates, as in log2(x+1) transformed RSEM normalized count. Genes are mapped onto the human genome coordinates using UCSC Xena HUGO probeMap.

  2. Dataset 2: phenotype – Phenotypes
    https://tcga.xenahubs.net/download/TCGA.PRAD.sampleMap/PRAD_clinicalMatrix.gz (71MB)
    This dataset provides clinical data including overall survival (OS), treatment regimens, cancer staging (Gleason scores), diagnostic results, histology, pathologic staging, tumor characteristics, and much more.

Hackathon Track#2: Brain Data Bank on Video Gaming Enhances Cognitive Skills
Submitted and Subject Matter Expert by Dr. David Ziegler
Director of Technology Program; Multimodal Biosensing – Neuroscape University of California San Francisco

[Reference: Nature. 2013 Sep 5; 501(7465): 97–101, doi: 10.1038/nature12486]

Problem Statement

Cognitive control is defined by a set of neural processes that allow us to interact with our complex environment in a goal-directed manner. Humans regularly challenge these control processes when attempting to simultaneously accomplish multiple goals (multitasking), generating interference as the result of fundamental information processing limitations. It is clear that multitasking behaviour has become ubiquitous in today’s technologically dense world3, and substantial evidence has accrued regarding multitasking difficulties and cognitive control deficits in our ageing population4.

Here we show that multitasking performance, as assessed with a custom-designed three-dimensional video game (NeuroRacer), exhibits a linear age-related decline from 20 to 79 years of age. By playing an adaptive version of NeuroRacer in multitasking training mode, older adults (60 to 85 years old) reduced multitasking costs compared to both an active control group and a no-contact control group, attaining levels beyond those achieved by untrained 20-year-old participants, with gains persisting for 6 months.

These findings highlight the robust plasticity of the prefrontal cognitive control system in the ageing brain, and provide the first evidence, to our knowledge, of how a custom-designed videogame can be used to assess cognitive abilities across the lifespan, evaluate underlying neural mechanisms, and serve as a powerful tool for cognitive enhancement.

      

Tutorial and Hands-on (No neuroscience background is needed but willing to work within a team is preferred)

Subject Matter Experts: Provide neuroscience overview and hands-on excerise on given datasets

  • Dr. David Ziegler (Tutorial), Director of Technology Program, Multimodal Biosensing, UCSF, USA
  • Dr. Seth Elkin-Frankston (Hands-on), Scientist, Cognitive Systems, Charles River Analytics Inc., USA

Challenges

Develop data mashup scheme based on use cases to cross reference different datasets and apply statistical analysis, visualization, and machine learning tools to statistically analyze and develop predictive models for what changed between “shoot only or single tasking” and “drive & shoot or multi-tasking” from the EEG (electroencephalography) signals. Think outside the box and come up with innovative ideas that bring more value out of the data, or choose one or more of the following to:

Beginner Challenge Questions

  1. What are the strengths vs. limitations of the EEG technology? How do consumer-level EEG headsets compare to laboratory-grade equipment?
  2. What are the realistic EEG applications in daily life (automatic driving, interactive games, Internet Marketing, etc.)? Provide convincing prototypes (virtual or real).
  3. Try to conduct an event-related potential (ERP) analysis of the data in one or more conditions. How does this approach compare to that used in the Nature paper (i.e., ERSP-Event-Related Spectral Perturbation or time-frequency analysis)? Hint: check out the EEGLab and Fieldtrip tutorial
Advanced Challenge Questions

  1. Try conducting an ICA decomposition analysis of the data (Hint: this is best done in EEGLab). How does this approach compare to that used in the Nature paper or the ERP analysis suggested above? What new information can we learn using this approach?
  2. Would a micro-state analysis be appropriate for the data? What new knowledge might we learn from such an approach?
  3. What advanced methods (e.g., deep learning, but also others) are available that would help predict post-game performance? Specifically by what mechanisms and by how much?
Computing Environment

  • Languages: MATLAB, Python, C++.
  • EEG Analysis & Visualization tools (all three are free and have excellent tutorials):
    • FieldTrip
      • MATLAB toolbox for M/EEG analysis
      • Largely command-line functionality
      • Particular emphasis on analyses in the time-frequency domain
    • EEGLAB
      • Interactive MATLAB toolbox for processing M/EEG data
      • GUI and command line options
      • Particular emphasis on ICA methods for decomposing data and extracting meaningful components
    • MNE
      • Open-source Python software for visualizing and analyzing M/EEG data
      • Particularly good for source-localization analysis
    • Cartool
Datasets

A subset sample dataset brain_sample.zip (Under Documentation: one subject, 330MB, freely available) and Full Datasets (49 subjects, 17GB, simple registration is required via 'Login') can be download from the IEEE DataPort. Datasets contains pair “single tasking” and “multi-tasking” with the following set of files:

  1. Dataset 1 (group of): xxxx_DS_n.bdf where DS = “drive and shoot” or multi-tasking and n = 1,2,3, etc.
  2. Dataset 2 (group of): xxxx_SO_n.bdf where SO = “shoot only” or single-tasking and n = 1,2,3, etc.
  3. Dataset 3 (group of): xxxxB_DS_n.bdf where DS = “drive and shoot” or multi-tasking and B = POST training and n = 1,2,3, etc.
  4. Dataset 4 (group of): xxxxB_SO_n.bdf where SO = “shoot only” or single-tasking and B = POST training and n = 1,2,3, etc.

All data recorded with BioSemi 64 (with bdf extension) and each bdf file is about 40MB. Files with the same “xxxx” (subject name) that have the ending ‘B’ are the POST training EEG files for a given subject. Note that all participants have both a PRE and POST (but most do).

Hackathon Team, Computing Environment, and Implementation White Paper

All participants must be registered via the IEEE Big Data 2018 Registration and attend physically. You may register as a team (up to four per team) or an individual (we will place you on a team). Each participant brings his/her own laptop with all the necessary computing tools. No remote computing resources are allowed. All implementation must be based on the original work. Participating teams are encouraged to submit implementation approach as a white paper which will be published as part of the IEEE Big Data Governance and Metadata Management publication three months after the hackathon event.

Evaluation Team 

  • David Belanger, Chair of IEEE Big Data Technical Community, Stevens Institute of Technology
  • Mahmoud Daneshmand, Vice-Chair of BDGMM, Steven Institute of Technology
  • Kathy Grise, Senior Program Director, Future Directions, IEEE
  • Joan Woolery, Senior Project Manager, Industry Connections, IEEE Standards Association, IEEE
  • Cherry Tom, Emerging Technologies Initiatives Manager, IEEE Standards Association, IEEE
  • Elizabeth Chang, Research Fellow, Department of Radiation Oncology, University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, USA
  • David Ziegler, Director of Technology Program; Multimodal Biosensing – Neuroscape University of California San Francisco, USA
  • Seth Elkin-Frankston, Scientist, Cognitive Systems, Charles River Analytics Inc, USA

Evaluation Criteria 

Technical Approach (40 pts)
- Data mashup (20)
- Big Data analytics (20)

Novelty (40 pts)
- Creativity (20)
- Efficiency (20)

Results (20 pts)
- Output content (10)
- Output format (10)

Winners
- IEEE Certificates for 1st, 2nd, 3rd winners
- All team members win a t-shirt

 

Important Dates

Oct 1, 2018: Paper submission deadline

Nov 1, 2018: Paper acceptance notification

Nov 15, 2018: Camera ready version due

Dec 3, 2018: Deadline for hackathon sign-up

Dec 10, 2018: Hackathon

Mar 11, 2019: Hackathon White Paper due

 

Program Schedules 

Day-1: Dec 10, 2018

TimeTopic
08:00 – 08:10Welcome, Wo Chang, Chair of IEEE BDGMM, NIST, USA
08:10 – 08:20Opening Remark, David Belanger, Chair of IEEE Big Data Technical Community, Stevens Institute of Technology
08:20 – 10:00
(in Parallel)
Hackathon Briefing on use case, datasets, challenges, Q/As
Hackathon Track#1: Personalized Medicine for Drug Targeting in Prostate Cancer Patients
Dr. Elizabeth Chang (Tutorial & Hands-on), Research Fellow, Department of Radiation Oncology, University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, USA
Hackathon Track#2: Brain Data Bank on Video Gaming Enhances Cognitive Skills
Dr. David Ziegler (Tutorial), Director of Technology Program, Multimodal Biosensing, UCSF, USA
Dr. Seth Elkin-Frankston (Hands-on), Scientist, Cognitive Systems, Charles River Analytics Inc., USA
10:00 – till next day 08:00Solving hackathon challenges
Next day 09:00 – 12:00 Hackathon Presentation and Evaluation, See Team & Criteria

Day-2: Dec 11, 2018

TimeTopic
09:00 - 12:00 Hackathon Evaluation, Evaluation Team
13:00 – 13:10 Welcome, Wo Chang, Chair of IEEE BDGMM, NIST
13:10 – 13:30 Opening Remark, David Belanger, Chair of IEEE Big Data Technical Community, Stevens Institute of Technology
13:30 – 14:00 Keynote Speaker: Topic: TBD

Speaker: TBD

14:00 – 14:20Invited Speaker on Big Data Governance Management:

Speaker: TBD

14:20 – 14:40Invited Speaker on Big Data Metadata Management:

Speaker: TBD

15:40 – 15:00Invited Speaker: Special Topic

Speaker: TBD

15:00 – 15:20Coffee Break
15:20 – 15:30Paper Presentation #1

Authors...

15:30 - 15:40Paper Presentation #2

Authors...

15:40 – 15:50Paper Presentation #3

Authors...

15:50 – 16:00Paper Presentation #4

Authors...

16:00 – 16:10Paper Presentation #5

Authors...

16:10 – 16:20Paper Presentation #6

Authors...

16:20 - 16:30Paper Presentation #7

Authors...

16:30 – 16:40Paper Presentation #8

Authors...

16:40 – 16:50Paper Presentation #9

Authors...

16:50 – 17:20Hackathon Ceremony

David Belanger and Kathy Grise

17:20 – 17:30 Announcement for next BDGMM Event, Wo Chang

 

Hackathon Organizers 

General Co-Chairs

Wo Chang
Digital Data Advisor
National Institute of Standards and Technology, USA
Convenor, ISO/IEC JTC 1/WG 9 Working Group on Big Data
Chair, IEEE Big Data Governance and Metadata Management
Email: chang@nist.gov

David Belanger (PhD)
Chair of IEEE Big Data Technical Community
Stevens Institute of Technology
Email: dgb@ieee.org

Mahmoud Daneshmand (PhD)
Professor, Stevens Institute of Technology, USA
Co-Chair, IEEE Big Data Governance and Metadata Management
Co-founder, IEEE BDIs
Email: mahmoud.daneshmand@gmail.com

Program Co-Chairs

Kathy Grise
Senior Program Director, Future Directions, IEEE Technical Activities, USA
Email: k.l.grise@ieee.org

Yinglong Xia (PhD)
Huawei Research America, USA
Co-chair, IEEE BDI - Big Data Management Standardization
Email: yinglong.xia.2010@ieee.org

Publicity Chairs

Cherry Tom
Emerging Technologies Intelligence Manager
IEEE Standards Association
445 Hoes Lane, Piscataway, NJ 08854-4141
Email: c.tom@ieee.org

 

Technical Program Committee

Name Organization Country
Paventhan ArumugamERNETIndia
Claire AustinS&T Strategies,Environment & Climate Change Canada Canada
Ismael CaballeroUCLMSpain
Yue-Shan ChangNational Central UniversityTaiwan
Periklis ChatzimisiosDepartment of Informatics, Alexander TEI of ThessalonikiGreece
Hung-Ming ChenNational Taichung University of Science and TechnologyTaiwan
Miyuru DayarathnaWSO2 Inc.Sri Lanka
Jacob DillesAcuant Corp.US
Robert HsuChung Hua University Taiwan
Wei HuNanjing UniversityChina
Carson LeungUniversity of ManitobaCanada
Sian Lun LauSunway University Malaysia
Christian Camilo Urcuqui LóepzIcesi University Colombia
Neil MillerThe bioinformatics for Children's Mercy HospitalUSA
Jinghua MinChina Electronic Cyberspace Great Wall Co., Ltd.China
Carlos MonroyRice UniversityUS
Huansheng NingUSTBChina
Arindam PalTCS ResearchIndia
Lijun QianPrairie View A&M UniversityUSA
Weining Qianx East China Normal UniversityChina
Yufei RenIBMUSA
Robby RobsonEduworks CorporationUS
Angelo Simone ScottoEuropean Food Safety Authority Italy
Priyaa ThavasimaniNewcastle University UK
Alex ThomoUniversity of VictoriaCanada
Chongang WangInterDigital CommunicationsUSA
Jianwu WangUniversity of Maryland, Baltimore CountyUS
Shu-Lin WangNational Taichung University of Science and TechnologyTaiwan
Jens WeberUniversity of VictoriaCanada
Lingfei WuIBM ResearchUSA
Hao XuUniversity of North Carolina at Chapel HillUS
Godwin YeboahUniversity of WarwickUK
Tim ZimmerlinAutomation TechnologiesUS