Scientific publication beyond the text: Sharing research objects summary (with videos)
A few weeks ago the Center for Reproducible Neuroscience in collaboration with Meta-research Innovation Center at Stanford (METRICS), Data Science Initiative (DSI), and the Center for Population Health Sciences (CPHS) presented an all-day conference titled: “Scientific publication beyond the text: Sharing research objects”. It was a fantastic day filled with interdisciplinary lectures on the emerging landscape of platforms for sharing code, data, and other research objects. We had an excellent lineup of speakers from a wide range of backgrounds and expertise. This allowed us to explore the increasingly growing ecosystem of available tools and platforms. We are sharing the videos with summaries for each of the presentations.
We started the conference with introductory remarks from Russell Poldrack (Professor, Stanford Psychology). He set out the goal of the conference: start a conversation for how scientific publication may look moving forward. He begins by establishing the foundation of scientific publishing – being able to replicate each other’s claims. Publishing has rapidly changed over the recent years with the advance of technology. One extension of publishing is by sharing data, a successful model is neuroimaging. Another way is by sharing code. This can be done through platforms such as GitHub. This can be further extended to sharing operating system configurations to ensure the analysis pipelines can be run by others. This has been addressed through software such as Docker containers. This all matters because there has been growing concern of not being able to reproduce scientific results across various fields. Russell Poldrack and his center (Center for Reproducible Neuroscience) develops tools and platforms to facilitate data sharing and reproducible pipelines.
Michael began his talk by placing us into his research domain – investigating child language – to provide a use case on how he thinks about metascience issues. In his field, there is an additional crisis related to theories and how the representation of this should look. There is an interplay between theory and data while a reproducible ecosystem can provide the backbone for theoretical insights.
He argues that it may not be a theoretical crisis by looking at the case of children early word learning. The case presented a computational model he developed as a graduate student. The critical underpinnings of this model and others were based in theory. Therefore, it may not be a problem with the theory but really with data. With more data we can train more sophisticated computational models. Our current models are falling short in connecting with outcomes. His lab went further and tried to reproduce results in 35 articles at Cognition, with only 13 being able to be reproduced. The other articles couldn’t be reproduced for several reasons such as their analytic process were unclear or were unable to replicate the reported statistical values. None of the conclusions were changed based on this work.
He moved our attention to the data collection and dissemination process. There are many resources out there but none provide rich structured standardized data. The data needs to be able prove or validate a computational model. Ultimately, he argues the need for domain-specific repositories housing theoretically-relevant constructs. Additionally, the repositories must be open, programmatically accessible, and share metadata.
Dr. Frank spent the remainder of his talk diving deeper into 3 repositories: childes-db, MetaLab, and Wordbank. Childes is an open database of child language with a shared research environment loaded with open tools and resources. While childes-db is an interface built on top of childes to facilitate flexibility and reproducibility using an API to more programmatically access childes and visualize the data they want online, facilitating hypothesis generation. Childes-db provides in-platform interactive visualizations for rapid prototyping. We shifted to discussing a repository interested in capturing how kids do in experiments: MetaLab. MetaLab is a database of meta-analyses for cognitive development research. It seeks to take a data driven approach to develop and validate expert driven child language development curves. This uses similar visualization tools seen in the childes-db project. The advantages of having more domain specific repositories is the additional incentive for researchers putting their data onto it is knowing your field will be reusing the data unlike more general data sharing repositories. The final repository discussed was Wordbank. Wordbank aims to investigate broad trends in language learning and outcomes. There is an international standardized form used to capture child language: The MacArthur-Bates communicative development inventory (CDI). This helps produce normative datasets. This lays the foundation for Wordbank. There is a set of interactives similar to the previous repositories that allow researchers to quickly evaluate language trends across culture.
Michelle focused her talk on discussing ethical considerations when sharing data. This can be summed up in 5 big questions: is “unfair use” a thing, how much should competitive harm matter, how can we give incentives to share – and should we try, whose interests are data holders protecting, and how can we ensure accountability. We considered these questions through hypothetical cases among the attendees. The audience formed groups and talked among themselves before sharing to the group. These discussions enabled us to dive deeper into these questions and hear the differing perspectives from the diverse set of attendees.
As we thought and discussed the first question, this led us to the idea of weaponizing data sharing. Cases were brought up illustrating how data sharing can be used by malicious actors to try to discredit their work, particularly those with policy or political implications. After evaluating the second question we explored the potential benefits and harms of data & code sharing across industries. Each industry has their own frameworks and processes that can change who would benefit and harm if sharing was mandatory. In addition, across different industries the requesters for use of the data can drastically change. The third question revolved around incentive structures and how they can be structured to be fair. Structuring an incentive to participate in data sharing can be very challenging due to problems associated with fairness, potential to be scooped, reduced incentives for primary researchers, time and effort to share, authorship, and the high variance across fields. The fourth hypothetical question explored whose interest is being protected by the data holder. There may not be explicit regulations governing data sharing so this will fall onto the data holder to protect their study participants. The fifth question was focused on accountability. While we were not able fully discuss this topic, the takeaway was the need to rethink our regulatory model of data privacy and sharing. A few points can be focused on: the uses rather than the exchange, shifting from the consent forms to a representational group approach, end data exceptionalism and more general thoughts on data sharing from more stakeholders.
Michael’s talk was interwoven between Stanford and non-Stanford related initiatives toward a more open publication ecosystem. He opened his presentation by discussing the details of Open Access coming to Stanford. The highlighted features of this are: no cost to anyone and freely available with unrestricted use and remixing. There are tradeoffs that come with moving to an Open Access model. A few issues are related to: discoverability, influence, and authenticity. These issues are remedied using the Stanford Digital Repository hosted by the Stanford Libraries. There was emphasis put on research articles and using it as a functional tool giving way to the next generation of research. We can also look at the peer review process and how/if a paper makes it to the peer review stage of publishing. Peer review is a crucial part of the scientific process, we heard the several different ways the publishers handle this aspect of science.
Repositories can play a role in preserving the research objects such as the publication and code. Publishers can deposit the objects onto these repositories and in the case they go out of business, the objects are preserved.
We were presented with a case from the digital humanities: ORBIS, a geospatial network model of the Roman world. This illustrates the role of sharing research with an interactive visualization. The Stanford libraries were also involved in the digitization of historical documents and arts. This is an open data sharing initiative. The challenge is locating and comparing various historical works that have been moved around through history.
Lindsey’s presentation was aimed at further exploring tools and platforms for performing reproducible research. She primarily discussed the Jupyter project ecosystem. Her domain of expertise is geosciences and detailed how they perform their research. To tackle the challenging questions in geosciences takes a combination of: domain knowledge, data engineering, data science, and software engineering. With the evolving research domain it becomes a natural question to ask who this work may be for and the research outputs to share with the diverse audience. As the research progress evolves, we see there are differing levels of audiences and engagement. We see the concern prior to publication is thinking about reproducibility and accessibility. While after the work has been published our concerns shift to explorability and extensibility.
We explored the scientific software ecosystem built on python. The programming language sets the foundation for the core software to sit on. Then specialized software calls the core functions. With the outer layer being domain specific software packages. This led the talk into learning more about Jupyter. Jupyter is an ecosystem of open tools and standards for running interactive computations. This is supported by a strong open community. JupyterLab is an extension of Jupyter notebooks that offers a full suite of tools and interfaces. Jupyter can also be extended to be run on the cloud or high performance computing center through the JupyterHub project. This helps create a shared environment for researchers to collaborate and preserve consistency. We learned about a couple projects that are implementing this ecosystem into their research. The projects are Syzygy and Callysto, both based in Canada. Syzygy is for university students and Callysto is for grades 5-12, both aimed at integrating Jupyter into education. This takes advantage of high performance computing resources in Canada. Another project, Pangeo, seeks to study the Earth utilizing cloud computing resources using Jupyter for the interactive features.
Once the research has been performed, we need a platform to share our research objects. The binder project seeks to do this. Binder is able to take a public git repository and generate a shareable, interactive, and reproducible environment for your research code. Binder is built on a combination of repo2docker, generates the environment, and JupyterHub, serves the environment. The case presented using this was the colliding of two black holes discovered by the LIGO group. There is capability to bind notebooks together and present as a textbook through the JupyterBook project. Lastly, these notebooks and textbooks can be used as educational materials with the interactive controls in Jupyter. This advancement allows everyone, regardless of means, access to analysis and visualize the outputs to gain an intuition of the results.
Alex began by placing into context his prior work at Microsoft research exploring data intensive science and how to communicate this work with the additional research objects (e.g. code). He provided a few examples of projects he worked on: Ontology Add-in for word, Chemistry drawing in word, and GenePattern Reproducible research. Working at a philanthropic organization, their near (10 year) goal is focused on funding and supporting new tools and platforms based on open and collaborative research helping to accelerate science. The open science investments aim to help drive: platforms, people, processes, knowledge discovery, open standards, and computational capacities.
Alex then laid the foundation for what he thinks are some of the attributes that should be satisfied for the paper of the future. To list a few: open, accessible, persistent, verifiable, machine readable, reusable, referenceable, and discoverable. Preprints are a good environment to test and experiment with different models to move science closer to the paper of the future. Preprints are a relatively new framework to share and even more recently in the biosciences (~2013). There is rapid adoption among the bioscience community toward preprints. Preprints enable use cases that don’t typically get published (e.g. null results). The next step is using preprints to feedback into the research process. There are tools being built around preprints to improve accessibility. Similar to preprints, preregistration and registered reports are challenging the typical publishing processes. There are potentially missing layers to promote integration and interoperability across all of our research objects. Alex wrapped up his talk by presenting the coronavirus to illustrate the power of preprints for contributing knowledge rapidly to understand more about the global health concern. This revealed the need to develop methods that can weed out bad preprints. Currently, there is no incentive structure to effectively do this.
We would like to thank all of our speakers for their presentations and everyone who attended and tuned in!