IEEE Workshop on Big Data Metadata and Management (BDMM ’2017)


big data header

IEEE Workshop on Big Data Metadata and Management 

(BDMM ’2017)

Boston, MA, USA

Dec 11-12, 2017

Dec 11 -- Hackathon

Dec 12 -- Workshop

In conjuction to IEEE Big Data 2017

Sponsored by IEEE Big Data Initiative (BDI)

Hackathon Registration: CLOSED



This workshop is aligned with the effort from the IEEE Big Data Initiative (BDI) on Standardization (see The BDI standards research group is studying on where there is a need and opportunity for developing IEEE Standards for Big Data Metadata, its management, and governance. BDMM 2017 follows the successful BDMM workshop held at IEEE BigData 2016 (

Big Data is a collection of data so large, so complex, so distributed, and growing so fast (or 5Vs- volume, variety, velocity, veracity, and value). It has been known for unlocking new sources of economic values, providing fresh insights into sciences, and assisting on policy making. However, Big Data is not practically consumable until it can be aggregated and integrated into a manner that a computer system can process. For instance, in the Internet of Things (IoT) environment, there is a great deal of variation in the hardware, software, coding methods, terminologies and nomenclatures used among the data generation systems. Given the variety of data locations, formats, structures and access policies, data aggregation has been extremely complex and difficult. More specifically, a health researcher was interested in finding answers to a series of questions, such as “How is the gene ‘myosin light chain 2’ associated with the chamber type hypertrophic cardiomyopathy? What is the similarity to a subset of the genes’ features? What are the potential connections among pairs of genes”? To answer these questions, one may retrieve information from databases he knows, such as the NCBI Gene database or PubMed database. In the Big Data era, it is highly likely that there are other repositories also storing the relevant data. Thus, we are wondering

  • Is there an approach to manage such big data, so that a single search engine available to obtain all relevant information drawn from a variety of data sources and to act as a whole?
  • How do we know if the data provided is related to the information contained in our study?
To achieve this objective, we need a mechanism to help us describe a digital source so well that allows it to be understood by both human and machine. Metadata is "data about data". It is descriptive information about a particular dataset, object or resource, including how it is formatted, and when and by whom it is collected. With those information, the finding of and the working with particular instances of Big Data would become easier. Besides, the Big Data must be managed effectively. This has partially manifested in data models a.k.a. “NoSQL”. The goal of this multidisciplinary workshop is to gather both researchers and practitioners to discuss methodological, technical and standard aspects for Big Data management. Papers describing original research on both theoretical and practical aspects of metadata for Big Data management are solicited.



Topics include, but are not limited to:
  • Metadata standard(s) development for Big Data management
  • Methodologies, architecture and tools for metadata annotation, discovery, and interpretation
  • Case study on metadata standard development and application
  • Metadata interoperability (crosswalk)
  • Metadata and Data Privacy
  • Metadata for Semantic Webs
  • Human Factors on Metadata
  • Innovations in Big Data management
  • Opportunities in standardizing Big Data management
  • Digital object architectures and infrastructures for Big Data management
  • Best practices and standard based persistent identifiers, data types registry structures and representations for Big Data management
  • Query languages and ontology in Big Data
  • NoSQL databases and Schema-less data modeling
  • Multimodal resource and workload management
  • Availability, reliability and Fault tolerance
  • Frameworks for parallel and distributed information retrieval
  • Domain standardization for Big Data management
  • Big Data governance for data integrity, quality, provenance, retention, asset management, and business intelligence
In addition to the accepted papers, the workshop intends to have an industry focus through a keynote speaker and hackathon challenges. The hackathon session will explore interoperable data infrastructure for Big Data Governance and Metadata Management that is scalable and can enable the Findability, Accessibility, Interoperability, and Reusability between heterogeneous datasets from various domains without worrying about data source and structure.


Paper submission instructions 

This workshop will only accept for review original papers that have not been previously published. Papers should be formatted based on the IEEE Transactions journals and conferences style; maximum allowed camera-ready paper length is ten (10) pages. Submissions must use the followiing formatting instructions:
8.5" x 11" x 2 (DOC, PDF, LaTex Formatting Macros)

Please use this submission site to submit your paper(s).

Accepted papers will be published in the IEEE BigData2017 proceedings (EI indexed). For further information please see IEEE BigData2017 @

Review procedure 

All submitted papers will be reviewed by 3 international program committees.


Hackathon: 24 hours on Data Mashup (Varieties Problem) Big Data Analytics  
Hackathon Registration:

Governance and metadata management poses unique challenges with regard to the Big Data paradigm shift. It is critical to develop interoperable data infrastructure for Big Data Governance and Metadata Management that is scalable and can enable the Findability, Accessibility, Interoperability, and Reusability between heterogeneous datasets from various domains without worrying about data source and structure.

Problem – Healthcare Fraud Detection

Large amount healthcare data is produced continually and store in different databases. With the wide adoption of electronic health records that has increased the amount of data available exponentially. Nevertheless, the healthcare providers have been slow to leverage the vast amount of data to improve healthcare system or use data to improve efficiency to reduce overall cost of healthcare.

Health care data has the potential to innovate the procedure of healthcare delivery in the US and inform healthcare providers about the most efficient and effective treatments. Value-based healthcare programs will provide incentives to both healthcare providers and insurers to explore new ways to leverage healthcare data to measure the quality and efficiency of care.

It is estimated that in the US healthcare spending approximately, $75B to $265B is lost each year to healthcare fraud1. With the amount of healthcare fraud, the importance of identifying fraud and abuse in healthcare cannot be ignored; healthcare providers must develop automated systems to identify fraud, waste and abuse to reduce its harmful impact on their business.

1 White SE. Predictive modelling 101. How CMS’s newest fraud prevention tool works and what it means for providers. J AHIMA. 2011;82(9): 46–47.

Hackathon Challenges

Develop data mashup scheme to cross reference different healthcare datasets [1] [2] [3] and apply statistical analysis, visualization, and machine learning tools to statistically analyze and develop predictive models for healthcare payment data and possibly detect irregularities and prevent healthcare payment fraud. Think outside the box and come up with innovative ideas that bring more value out of the data, or choose one or more of the following to solve:

  1. How many physicians from each state?
  2. How many specializations out of how many physicians?
  3. Map anomalies or missing data across the country or within states or counties or electoral districts
  4. Correlate anomalies with research funding of the respective conditions
  5. Identify counties/ hospitals/ suppliers etc. with most or least anomalies
  6. List top 5 anomalies with probability ranking and wrong charges statistics
Hackathon Team, Computing Environment, and Implementation White Paper

All participants must be registered via the IEEE Big Data main conference website and attend physically. You may register as a team (up to four per team) or an individual (we will place you on a team). Each participant brings his/her own laptop with all the necessary computing tools. No remote computing resources are allowed. All implementation must be based on the original work. Participating teams are encouraged to submit implementation approach as a white paper which will be published as part of the IEEE Big Data Governance and Metadata Management publication three months after the hackathon event.


  1. Medicare Provider Utilization and Payment Data - Physician and Other Supplier from Centers for Medicare & Medicaid Services (CMS) - Physician’s Billing: (9 million records; ~500MB compressed; ~1.7 GB uncompressed)
  2. National Physician Identifiers (NPI) from CMS - Physician Identifiers: (~600 MB)
  3. Health Care Provider Taxonomy Code Set CSV from National Uniform Claim Committee, American Medical Association - Physician Specialization: (~400 KB)

Evaluation Team 

  • David Belanger, Chair of IEEE Big Data Technical Community, Stevens Institute of Technology
  • Mahmoud Daneshmand, Vice-Chair of BDGMM, Steven Institute of Technology
  • Kathy Grise, Senior Program Director, Future Directions, IEEE
  • Joan Woolery, Senior Project Manager, Industry Connections, IEEE Standards Association, IEEE
  • Cherry Tom, Emerging Technologies Initiatives Manager, IEEE Standards Association, IEEE
  • Robby Robson, Member of IEEE Standards Association Standards Board, CEO, Eduworks Corporation

Evaluation Criteria 

Technical Approach (40 pts)
- Data mashup (20)
- Big Data analytics (20)

Novelty (40 pts)
- Creativity (20)
- Efficiency (20)

Results (20 pts)
- Output content (10)
- Output format (10)

- 1st Place: $2,000*
- 2nd Place: $1,000*
- 3rd Place: $500*

- All team members win a t-shirt
'*' - at the discretion from the Evaluation Team


Important Dates

Nov 10, 2017: Due date for full workshop paper submission 

Nov 15, 2017: Notification of paper acceptance to authors 

Nov 20, 2017: Camera-ready of accepted papers 

Dec 1, 2017: Deadline for hackathon sign-up

Dec 11, 2017: Hackathon

Dec 12, 2017: Workshop

Mar 12, 2018: Due date for Hackathon White Paper


Program Schedules 

Day-1: December 11, 2017

08:00 – 08:10Welcome, Wo Chang, Chair of IEEE BDGMM, NIST
08:10 – 08:20Opening Remark, David Belanger, Chair of IEEE Big Data Technical Community, Stevens Institute of Technology
08:20 – 10:00Briefing about the use case, datasets, challenges, Q/As, Wo Chang
10:00 – till next day 08:00 Solving hackathon challenges
Next day 08:00 – 09:00 Evaluation, See Team & Criteria
Day-2 Late AfternoonAward Ceremony

Day-2: December 12, 2017

14:00 – 14:10 Welcome, Wo Chang, Chair of IEEE BDGMM, NIST
14:10 – 14:30 Opening Remark, David Belanger, Chair of IEEE Big Data Technical Community, Stevens Institute of Technology
14:30 – 15:00 Keynote Speaker: Digital Object Architecture

Larry Lannom, Vice President of Corporation for National Research Initiatives (CNRI)

15:00 – 15:30Invited Speaker: Managing Big Time Series & Text Data for Unsupervised Feature Representation Learning

Linqfei Wu, Research Staff Member of IBM AI Foundations Lab, IBM T. J. Watson Research Center, US

15:30 – 15:50Invited Speaker: Towards FAIR Open Science with PID Kernel Information: The RPID Testbed

Yu Luo, Research Assistant of Data To Insight Center, Indiana University Bloomington. Bloomington, Indiana, US

15:50 – 16:05Why-Diff: Explaining Differences amongst Similar Workflow Runs by exploiting Scientific Metadata

Priyaa Thavasimani, Jacek Cala, and Paolo Missie

16:05 – 16:25Coffee Break
16:25 – 16:40Case: Big Geosciences Data Validation Challenges and Achievements

Hussain Alajmi

16:40 – 16:55Deep Learning for Big Data Analytics: A Review from Fog and Edge Computing Perspective

Swarnava Dey and Arijit Mukherjee

16:55 – 17:20Hackathon Ceremony

David Belanger and Kathy Grise

17:20 – 17:30 Announcement for next BDGMM Event, Wo Chang

Keynote Speaker 

Larry Lannom, Director of Information Services and Vice President of Corporation for National Research Initiatives (CNRI), US
Mr. Larry Lannom is Director of Information Services and Vice President at the Corporation for National Research Initiatives (CNRI), where he works with organizations in both the public and private sectors to develop experimental and pilot applications of advanced networking and information management technologies.

In addition to his activities at CNRI, Mr. Lannom serves as Co-chair of the U.S. Branch of the Research Data Alliance, as a member of the RDA Technical Advisory Board, as a member of the National Data Service Technical Advisory Council, and as a member of the U. S. Treasury Office of Financial Research (OFR) Financial Research Advisory Committee.

Mr. Lannom joined CNRI in September of 1996. Prior to that, he was a Technical Director at DynCorp, Inc., where he served as an advisor on digital library research for the ISTO, CSTO, and ITO offices of the U.S. Defense Advanced Research Projects Agency (DARPA), including initiating the Computer Science Technical Reports (CS-TR) project, DARPA's first effort in the digital library area. In addition, he managed the development of internal information systems for DARPA.

Invited Speaker 

Lingfei Wu, Research Staff Member of IBM AI Foundations Lab, IBM T. J. Watson Research Center, US
Abstract: Learning effective representation is a key foundation for numerous machine learning and data mining techniques in time-series and NLP applications. Despite a number of feature representation methods including kernel methods and deep learning approaches have been proposed in each domain, the effectiveness and efficiency of most methods are still challenged by either limited number of labeled data or high computational complexity. In this talk, I will introduce a generic framework to generate vector representation of time-series and text. To this end, we first construct a family of positive definite (p.d.) alignment-aware time series or text kernels, guided by a new methodology for transforming a distance metric to a positive-definite kernel. Then we present a novel time-series and text embeddings (RWS and WME), a random features method for these proposed p.d. kernels to learn an unsupervised representation for time-series and text data. Extensive experiments on real-world time-series and text classification tasks demonstrate that RWS and WME can outperform or match current state-of-the-art methods in terms of both testing accuracy and runtime in each domain.

Dr. Lingfei Wu is a Research Staff Member of IBM AI Foundations Lab at IBM T. J. Watson Research Center. He earned his Ph.D. degree in computer science from College of William and Mary in August 2016, under the supervision of Prof. Andreas Stathopoulos. His research interests mainly span in large-scale machine learning, scalable data mining, big data analytics, numerical linear algebra, and high-performance mathematical software.

Yu Luo, Research Assistant of Data To Insight Center, Indiana University Bloomington. Bloomington, Indiana, US
Abstract: Using persistent identifiers (PIDs) to identify digital data products whether a product is a collection, a file, or an object of some types is a good practice in open science. A persistent identifier ensures that a 1:1 relationship between identifier and data product persists into the future. Naming solutions for digital data products eventually resolve a PID down to the digital object it identifies, but the current landscape is limited by multiple solutions with weak interoperability, and inconsistent protocols for getting from PID to data object. In a world of increasing PID use, we will soon be awash with billions of PIDs that all resolve to digital objects using various inconsistent and unpredictable approaches, making it difficult to build higher level services that cross the various approaches. In this talk, I will introduce the RPID Testbed, which is using Handle System and Data Type Registry Service, generating PIDs for digital objects. The PIDs will be assigned as Handles of specified data types in Data Type Registry (DTR) and relative values. Data types in DTR provide a way of easily registering detailed and structured descriptions of data, and reuse them in different PIDs. PIDs contain limited Information (PID Kernel Information) connected to FAIR Principles: findable, accessible, interoperable and reusable. The RPID Testbed aims to the exploration driven by identifying and evaluating minimal information that can go into Kernel Information that can help make Data Objects FAIR and less dependent on the repository system to enforce FAIRness.

Dr. Yu Luo is a Research Assistant of Data To Insight Center at Indiana University Bloomington. He is working on his Ph.D. degree in computer science from Indiana University, under the supervision of Prof. Beth Plale. His research interests mainly span in Persistent Identifier, Provenance, and Data Management.


Workshop Organizers 

General Co-Chairs

Alex Mu-Hsing Kuo (PhD)
University of Victoria, Canada
Leader, IEEE Big Data Education Tracks
Co-chair, IEEE BDI - Big Data Management Standardization

Mahmoud Daneshmand (PhD)
Professor, Stevens Institute of Technology, USA
Co-Chair, IEEE Big Data Governance and Metadata Management
Co-founder, IEEE BDIs

Wo Chang
Digital Data Advisor
National Institute of Standards and Technology, USA
Convenor, ISO/IEC JTC 1/WG 9 Working Group on Big Data
Chair, IEEE Big Data Governance and Metadata Management

Program Co-Chairs

Kathy Grise
Senior Program Director, Future Directions, IEEE Technical Activities, USA

Yinglong Xia (PhD)
Huawei Research America, USA
Co-chair, IEEE BDI - Big Data Management Standardization

Publicity Chairs

Cherry Tom
Emerging Technologies Intelligence Manager
IEEE Standards Association
445 Hoes Lane, Piscataway, NJ 08854-4141


Technical Program Committee

Name Organization Country
Paventhan ArumugamERNETIndia
Claire AustinS&T Strategies,Environment & Climate Change Canada Canada
Ismael CaballeroUCLMSpain
Yue-Shan ChangNational Central UniversityTaiwan
Periklis ChatzimisiosDepartment of Informatics, Alexander TEI of ThessalonikiGreece
Hung-Ming ChenNational Taichung University of Science and TechnologyTaiwan
Miyuru DayarathnaWSO2 Inc.Sri Lanka
Jacob DillesAcuant Corp.US
Robert HsuChung Hua University Taiwan
Wei HuNanjing UniversityChina
Carson LeungUniversity of ManitobaCanada
Sian Lun LauSunway University Malaysia
Christian Camilo Urcuqui LóepzIcesi University Colombia
Neil MillerThe bioinformatics for Children's Mercy HospitalUSA
Jinghua MinChina Electronic Cyberspace Great Wall Co., Ltd.China
Carlos MonroyRice UniversityUS
Huansheng NingUSTBChina
Arindam PalTCS ResearchIndia
Lijun QianPrairie View A&M UniversityUSA
Weining Qianx East China Normal UniversityChina
Robby RobsonEduworks CorporationUS
Angelo Simone ScottoEuropean Food Safety Authority Italy
Priyaa ThavasimaniNewcastle University UK
Alex ThomoUniversity of VictoriaCanada
Chongang WangInterDigital CommunicationsUSA
Jianwu WangUniversity of Maryland, Baltimore CountyUS
Shu-Lin WangNational Taichung University of Science and TechnologyTaiwan
Jens WeberUniversity of VictoriaCanada
Lingfei WuIBM ResearchUSA
Hao XuUniversity of North Carolina at Chapel HillUS
Godwin YeboahUniversity of WarwickUK
Tim ZimmerlinAutomation TechnologiesUS