March 2016 – INSPIRE-HEP Blog

Traditionally DOIs (Digital Object Identifiers) have been associated with published papers in the digital era, but papers are not the only research objects that physicists may want to search, use, and cite. We talked with Jim Simone of Fermilab about his efforts to get DOIs assigned to MILC collaboration datasets and to get records of them uploaded to INSPIRE.

How is Jim involved with the MILC collaboration?

Jim is a member of FERMILAB-LATTICE collaboration, which works closely with MILC on scientific projects involving matrix elements and flavor physics. MILC generates data sets consisting of lattice gauge configuration files, which the collaboration has made openly available for others to use, as is increasingly becoming required for federally funded research in the U.S.

What is the MILC collaboration’s connection to the International Lattice Data Grid (ILDG)?

Jim was an early organizer of the ILDG, which is intended as a data grid to enable collaborations to share gauge configurations. The ILDG metadata catalog had its limitations; it only held limited kinds of metadata, sometimes making it difficult for people to find what they were looking for. People involved with the project have been trying to fill in the gaps, including the biggest problem: connecting scientific papers produced by the data to the datasets.

Rather than reinventing the wheel, ILDG is considering to use INSPIRE as a catalog to connect papers with datasets, making the data usable and findable by all physicists, including HEP and nuclear phenomenologists, as ILDG is currently only used by lattice scientists. In INSPIRE the datasets and associated papers can be searched starting with the papers in order to see what configurations were used to get the results, though in the upcoming version of INSPIRE, the Data collection will be made public and searching will also be possible starting with the individual datasets and from there finding what papers were produced from these configurations.

_{INSPIRE record of MILC dataset that has been cited. http://dx.doi.org/10.15484/milc.asqtad.en05b/1178157}

_{References in INSPIRE record of a paper that cited MILC datasets.}

Why and how did Jim go about getting DOIs assigned to the datasets? What challenges did he face?

Jim believes DOIs, as public, persistent identifiers, are a natural mechanism to identify the datasets, which are public, first class data objects, and permanent. With DOIs, the configurations will be better integrated into the ILDG and INSPIRE.

In the case of published papers, DOIs are assigned by publishers, but this route would not work for datasets. While INSPIRE is equipped to directly issue DOIs, MILC’s direct connection to the U.S. Department of Energy (DOE) made it practical for DOIs to be issued by DOE Office of Scientific and Technical Information (OSTI). In either case, DOIs are registered with the central agency DataCite.

ILDG has started a discussion on how other groups can get DOIs for their datasets. Outside the DOE, CERN also issues DOIs, and regional ILDG groups can help members get DOIs and serve as gatekeepers to keep the metadata clean and clear. DataCite can also help researchers find registration organizations.

For Jim it was a learning experience working with OSTI and interacting with their web services. As one of his main focuses was findability, Jim wanted to include lots of searchable metadata in the dataset records so to help physicists find the particular configurations they wanted. This amount of metadata was more than OSTI was used to receiving when minting DOIs, but they were able to work with Jim’s requests and he considered them a great help through the entire process

Beyond getting the DOIs assigned, another challenge was figuring out how citations should be marked up in papers, both written and digitally. With the goals of making the datasets findable and identifiable, Jim and the ILDG wanted people to be able to see the DOI in a print version of a reference list as well as click it in a digital version. In order to make the process as transparent as possible for people citing the datasets, Jim worked with us to include instructions in the metadata of the INSPIRE records and OSTI records.

For researchers unsure of how to cite datasets that do not include specific citation guidelines in their metadata, DataCite and CrossRef have developed a DOI citation formatter that can take a DOI registered by either of these services and format its citation in a variety of styles.

When going through the publication process with a paper that used MILC configurations, Jim found the referees and copy editors weren’t familiar with how the citations should appear. Most objects with a DOI are published papers that can be cited in written format using a journal reference, volume, page range, etc., so the DOI is often left out of the text of a reference list. However, following this standard would not make the datasets adequately identifiable to the human eye.

The community known as FORCE 11 (Future of Research Communication and e-Scholarship) has developed eight principles of data citation practices with equal emphasis on human readability and machine-actionability. As these recommendations become more widely endorsed in research communities and researchers become accustomed to citing datasets in their papers, the issue of human identifiable data citations will most likely be resolved.

What advice does Jim have for others looking to make their datasets more findable and citable?

Jim has two pieces of advice: get DOIs and mark up the metadata in a way that’s sensible for the community who will use the datasets. DataCite makes this simple by being explicit about its mandatory metadata requirements, while also allowing for additional recommended and optional metadata.

At INSPIRE we look forward to integrating more dataset DOIs into our records. Send your questions and comments about dataset DOIs in INSPIRE to feedback@inspirehep.net.

一年一度的INSPIRE高被引文章列表（INSPIRE Topcites）对上一年度的热点话题提供了概览。为了保证专注于高能物理领域，我们发布的这份列表中仅考虑来自核心文章的被引次数。为了确保覆盖面的广度，我们还针对每一个arXiv的类别给出了高被引文献列表。

除了列表中间的几篇文章和最后几篇有关量子涨落的经典文章，2015年40篇高被引文章列表延续了近期的趋势，与上一年度基本一致。2014年的前五篇文章依然在今年的高被引文章列表中，今年排位第五【5】的由Maldacena在1997年发表的有关AdS/CFT的文章比排位第六【6】的在2002年发表的描述GEANT4的文章（这篇文章去年排名第七），多了近150次的被引频次。

列表中出现的第一篇新文章位列第七【7】，这是一篇讲宇宙学参数普朗克的文章，是基于发表于2013年普朗克文章【3】的最新研究成果。自从这篇文章在2015年2月发表之后，被引频次超过700，这样，有关普朗克的文章在列表中总共有4篇。2015年2月还有一篇有关膨胀的文章【27】，也是基于发表于2013年一篇文章【30】更新后的研究成果。

第15位是我们的列表中新出现的第二篇文章【15】，发表于2014年，是发表于2011年MadGraph5文章【16】的后续，它描述了自动计算，微扰理论的次领头阶的软件包。

提交至arXiv.org名单上的论文中，有11篇来自hep-ph领域，4 篇来自hep-th，当然，还有两篇发现希格斯的论文是来自hep-ex领域，另外有10 篇来自astro-ph领域（8篇来自astro-ph.CO领域和两篇发表于1998年的关于超新星的文章【10，13】，如果这两篇文章撰写的时候有astro-ph.CO的子类别，那么这两篇文章就会被归属于这个类别）。astro-ph类别里都是观测类文章，因此我们能看出理论和“实验性”论文的数量大致相当。发表于数字化时代之前的有关粒子物理和宇宙学的经典论文，却是由于近代的科学研究和发现出现在了高被引文章列表中。有趣的是，从这些文章年度被引用频次的图表中都能看出明显的上升趋势。

	[39]		[37]		[33]		[32]
	[31]		[26]		[20]		[23]

Read the post in English here.

Monthly Archives: March 2016

DOIs and Lattice QCD gauge ensemble datasets

2015年度高被引文章概览 (Topcites 2015 edition)