Gobbledygook

Altmetrics: first we need the for what? and only then the how? OK?

Martin Fenner — Tue, 09 Jul 2013 12:27:50 +0000

Altmetrics track the impact of scholarly works in the social web. Article-Level Metrics focuses on articles, but also looks at traditional citations and usage statistics. The PLOS Article-Level Metrics project was started in 2008. The altmetrics manifesto was published in October 2010 and described the fundamental ideas. By October 2011 we had a number of altmetrics tools, fueled by the Mendeley/PLOS API programming contest. In 2012 the focus shifted from the fact that we can provide these numbers to a discussion of the many open questions. We could see this at the altmetrics12 conference in June, and even more so at the altmetrics workshop hosted by PLOS last week in San Francisco.

Altmetrics can provide a large amount of information about the post-publication activity around an article (and other scholarly content), and this is exciting, but at the same time also somewhat overwhelming and scary. Some of the things that we as a community have to figure out include standards for collecting, aggregating and displaying altmetrics data, strategies to combat attempts to game these metrics, and finding appropriate ways for the different organizations providing altmetrics to work together as a community. These and other topics were discussed in great detail at the PLOS altmetrics workshop, and we made excellent progress not least thanks to the excellent moderation by Cameron Neylon. The third day of the workshop was a hackathon, and we were able to translate some of the ideas into prototypes of new tools.

The most important conclusion from the workshop for me personally was that weshould really should focus on use cases. Altmetrics should help answer questions that we can’t answer today, and despite the promise, the various altmetrics tools still have a log way to go. A case in point is the promise that altmetrics can make it easier to find relevant scholarly content. We all use social media to help us find papers and other stuff, but integration of altmetrics into the traditional scholarly search tools is still missing. ReRank is a cool prototype developed during the hackathon last Saturday, but we are still a long way from having altmetrics feeding directly into the relevance sorting of search results.

With these thoughts in the back of mind, I look forward to the altmetrics session at the SpotOn London conference this Sunday afternoon. Sarah Venis from Médecins sans Frontières (MSF) will talk about the questions that she hopes altmetrics can answer for her organization. MSF is very interested to look beyond citations for the impact of their publications, as their primary target audience is not really the scholarly community, but rather people in need in various parts of the world. Marie Boran from the Digital Research Enterprise Institute (DERI) is interested in using altmetrics as a recommendation tool to find researchers with similar interests. Euan Adie from altmetric.com and I (technical lead for the PLOS Article-Level Metrics project) will use our respective tools to try to answer some of these questions. For me altmetrics are primarily tools to tell a good story, and that is one reason why we picked the title Altmetrics beyond the Numbers for this session. The focus of the session will then shift to an open discussion, and I hope we can get some good answers to this and other questions.

A clear focus on use cases should go a long way to reduce that feeling of being overwhelmed by all the numbers that altmetrics can provide. If we have specific goals for which we need altmetrics, it becomes much easier to decide what numbers work best for us, what standards we need and whom to ask to collect this information. AJ Cann and Brian Kelly have written two excellent blog post about the confusion that too many altmetrics numbers can create, and the workshop Assessing Social Media Impact during SpotOn London addresses some of these questions. Hackathons have played an important role in the history of altmetrics. I invite you to come to the SpotOn London hackathon this Saturday if you have some cool ideas and want to get started with the help of others.

Goodbye PLOS Blogs, Welcome Github Pages

Martin Fenner — Sat, 15 Jun 2013 06:12:51 +0000

This is the last Gobbledygook post on PLOS Blogs, and at the same time the first post at the new Github blog location. I have been blogging at PLOS Blogs since the PLOS Blogs Network was launched in September 2010, so this step wasn’t easy. But I have two good reasons.

In May 2012 I started to work as technical lead for the PLOS Article-Level Metrics project. Although this is contract work, and I also do other things – including spending 5% of my time as clinical researcher at Hannover Medical School – this created the awkward situation that I was never quite sure whether I was blogging as Martin Fenner or as someone working for PLOS. This was all in my head, as I never had any restrictions in my blogging from PLOS. With the recent launch of the PLOS Tech Blog there is now a good venue for the kind of topics I like to write about, and I have started to work on two posts for this new blog.

There will always be topics for which the PLOS Tech Blog is not a good fit, and for these posts I have launched the new personal blog at Github. But the main reason for this new blog is a technical one: I’m moving away from blogging on WordPress to writing my posts in markdown (a lightweight markup language), that are then transformed into static HTML pages using Jekyll and Pandoc. Last weekend I co-organized the workshop Scholarly Markdown together with Stian Haklev. A full workshop report will follow in another post, but the discussions before, at and after the workshop convinced me that Scholarly Markdown has a bright future and that it is time to move more of my writing to markdown. At the end of the workshop each participant suggested a todo item that he/she would be working on, and my todo item was “Think about document type where MD shines”. Markdown might be good for writing scientific papers, but I think it really shines in shorter scientific documents that can easily be shared with others. And blog posts are a perfect fit.

The new site is work in progress. Over time I will copy over all old blog posts from PLOS Blogs, and will work on the layout as well as additional features. Special thanks to Carl Boettiger for helping me to get started with Jekyll and Github pages.

re3data.org: registry of research data repositories launched

Martin Fenner — Sat, 01 Jun 2013 13:27:43 +0000

Earlier this week re3data.org – the Registry of Research Data Repositories – officially launched. The registry is nicely described in a preprint also published this week.

re3data.org offers researchers, funding organizations, libraries and publishers and overview of the heterogeneous research data repository landscape. Information icons help researchers to identify an adequate repository for the storage and reuse of their data.

I really like re3data.org, and that is not because I personally know several of the people involved in this project, or because they cited this blog in their preprint. I think that we are just at the beginning of building the infrastructure needed for research data management, and re3data.org fills an important need. In my opinion it is not enough to provide lists of research data repositories, we need additional information that can help guide researchers in selecting an appropriate research data repository. re3data.org has addressed this nicely by providing a vocabulary for the registration and description of research data repositories, and by creating a simple icon system:

Possible values for each icon. From http://dx.doi.org/10.7287/peerj.preprints.21v1

Future directions I would like re3data.org to take include:

Training and education. Researchers probably pick research data repositories mainly based on the familiarity of the repository within their community rather than the criteria developed by re3data.org. A lot more training and education is needed before researchers understand the importance of persistent identifiers, licenses and other criteria.
Integration. re3data.org can make it easier to integrate into existing scientific infrastructure, e.g. by using persistent identifiers such as DOIs for research data repositories, or by providing an API that makes it easier for other services to integrate re3data.org.
Governance. Whether or not scientific infrastructure such as re3data.org is accepted and used by the community depends on many factors, and governance is one of the most important ones. re3data.org should seek the support of other organizations, in particular from outside Germany. A governing board, re3data.org as an independent organization, and strategies to coordinate with similar efforts such as Databib are possible strategies.

Metrics and attribution: my thoughts for the panel at the ORCID-Dryad symposium on research attribution

Martin Fenner — Tue, 21 May 2013 14:09:56 +0000

This Thursday I take part in a panel discussion at the Joint ORCID – Dryad Symposium on Research Attribution. Together with Trish Groves (BMJ) and Christine Borgman (UCLA) I will discuss several aspects of attribution. Trish will speak about ethics, Christine will highlight problems, and I will add my perspective on metrics. This blog post summarizes the main points I want to make.

Oxford. Source: Wikimedia Commons

Scholarly metrics can be used in discovery tools, as business intelligence for funders, research organizations or publishers, and for research assessment. For all these scenarios – and in particular for research assessment – it is important to not only collect metrics for a particular journal publication, dataset or other research output, but to also link these metrics to the creators of that research output. That is why unique identifiers for researchers, and ORCID in particular, are so important for scholarly metrics, and this is also reflected in the ORCID membership of organizations such as Thomson Reuters, Elsevier/Scopus, Altmetric or F1000Prime who provide metrics in a variety of ways.

DORA

A good starting point for any discussion on metrics for research assessment is the San Francisco Declaration on Research Assessment (DORA) that was published las week, together with a set of editorials in several journals, including the Journal of Cell Biology, Molecular Biology of the Cell, EMBO Journal, Science, Journal of Cell Science, and eLife. The first three recommendations are a good starting point for the panel discussion:

Do not use journal-based metrics, such as Journal Impact Factors, as a surrogate measure of the quality of individual research articles, to assess an individual scientist’s contributions, or in hiring, promotion, or funding decisions.
Be explicit about the criteria used in evaluating the scientific productivity of grant applicants and clearly highlight, especially for early-stage investigators, that the scientific content of a paper is much more important than publication metrics or the identity of the journal in which it was published.
For the purposes of research assessment, consider the value and impact of all research outputs (including datasets and software) in addition to research publications, and consider a broad range of impact measures including qualitative indicators of research impact, such as influence on policy and practice.

Persistent Identifiers

Before we can collect any metrics, we need persistent identifiers for research outputs. Most journal articles now come with a DOI, but we should make it easier for smaller publishers to use DOIs, as cost unfortunately is still an issue.

Persistent identifiers for data are a much more complex issue, as there a number of persistent identifiers out there (including DOIs, handles, ARKs and purls), in addition to all the domain-specific identifiers, e.g. for nucleotide sequence or protein structures. DataCite DOIs are probably the first choice for attribution, as this is their main use case and they have features that make attribution easier (e.g. familiar to researchers, funders and publishers, global resolver). There are many other use cases for identifiers for data (e.g. to identify temporary datasets in an ongoing experiment), and is of course possible to use several identifiers for the same dataset. CrossRef is of course also issuing DOIs for datasets on behalf of their members, and the publisher PLOS is for example using CrossRef component DOIs for figures and supplementary information associated with a journal article, and is making them available via figshare.

Particular challenges with persistent identifiers for research data include different versions of a dataset, and aggregation of datasets (e.g. whether we want to cite the aggregate dataset, or a particular subset). Persistent identifiers for other research outputs are an even bigger challenge, e.g. how to uniquely identify scientific software.

In addition to persistent identifiers for research outputs, we also need persistent identifiers for researchers. ORCID is obviously a good candidate, as it focusses on attribution (by allowing researchers to claim their research outputs and by integration in many researcher workflows). But it is clear that ORCID is not the only persistent identifiers for researchers, and that we need to link these identifiers, e.g. ORCID and ISNI.

Depending on how we want to aggregate the metrics we are interested in, we might also need persistent identifiers for institutions, for funding agencies and their grant IDs, and for resources such as particle accelerators or research vessels. Unfortunately much more work is needed in these areas.

Attribution

Attribution is then the next step, linking persistent identifiers for research outputs to their creators. Attribution is therefore essential for research assessment. The Amsterdam Manifesto on Data Citation Principles that came out of the Beyond the PDF 2 workshop in March are an excellent document, but are unfortunately missing the important step of linking persistent identifiers for data to the persistent identifiers of their creators.

One important issue related to attribution is the provenance of the claims. Has a researcher claimed authorship for a particular paper, is a data center linking creators to research data, or is a funder doing this? The ORCID registry is built around the concept of self-claims by authors, but will allow the other stakeholders to confirm these claims.

Metrics

Metrics for scholarly content fall into one of three categories:

Citations
Usage stats
Altmetrics

Altmetrics is a mixed bag of many different things, from sharing on social media such as Twitter or Facebook to more scholarly activities such as Mendeley bookmarks or F1000Prime reviews. I therefore expect the altmetrics category to over time further evolve into 2-3 sub-categories.

We are all familiar with citation-based metrics for journal articles. We currently see the long-overdue shift from journal-based citation metrics to article-level metrics (see #1 from the DORA statement above for the reasoning), and as the technical lead for the PLOS Article-Level Metrics project I of course welcome this shift in focus. We also see a trend towards opening up reference lists that will make citation-based metrics much more accessible, and the JISC Open Citations project by David Shotton and others is an important driver in this, as is the Open Bibliographic Data project by OKFN. Until open bibliographic data become the norm, we have to deal with different citation counts from different sources. PLOS is collecting citations from Web of Science, Scopus, CrossRef and PubMed Central, and the citation counts are highly correlated overall (e.g. R2= 0.87 for CrossRef and Scopus citations for 2009 PLOS Biology papers), but for some papers differ substantially. Similar to persistent identifiers, reference lists of publications should become part of the open e-infrastructure for science and not depend on proprietary systems. This makes citation metrics more transparent and easier to compare, and fosters research and innovation, in particular by smaller organizations.

The data citation community has adopted the journal article citation model, and we are starting to see more citations to datasets. Even though data citations look similar to citations of journal articles, many essential tools and services still don’t properly handle datasets. The Web of Knowledge Data Citation Index is an important step in the right direction, as is the new DataCite import tool for ORCID. Something that we should pay closer attention to is the citation counts of the paper(s) associated with a dataset. Maybe the major scientific impact is in the data, but scientific practice still dictates to the cite the corresponding paper and not the dataset itself (one of the reasons we see data journals being launched). The DataCite metadata can contain the persistent identifier of the corresponding journal article, thus making it possible to associate the citation count of the corresponding paper with the dataset. This approach is particularly important for datasets that are always part of a paper, as is the case for Dryad. One important consideration is that contributor lists may differ between journal article and dataset, or between related datasets.

Another problem with data citation is that citation counts might not be the best way to reflect the scientific impact of a dataset. We are increasingly seeing usage stats for datasets, and DataCite for example has started in January to publish monthly stats for the most popular datasets by number of DOI resolutions. The #1 dataset in March was the raw data to a figure in a F1000Research article, hosted on figshare.

Similar to citations we see a strong trend for usage stats to move from aggregate numbers for journals to article-level metrics. COUNTER has released a draft code of practice for their PIRUS (Publisher and Institutional Repository Usage Statistics) standard in February, and increasing numbers of publishers and repository infrastructure providers such as IRUS-UK and OA-Statistics are providing usage stats for individual articles.

One challenge with usage stats, in particular with Open Access content, is that an article or other research output might be available in more than one place, e.g. publisher (or data center), disciplinary repository and institutional repository. For PLOS articles we don’t know the aggregated usage stats from institutional repositories, but we know that 17% of HTML pageviews and 33% of PDF downloads happen not at the PLOS website, but at PubMed Central.

Altmetrics provide new challenges, but they are also a more recent development compared to usage stats and citations. Similar to usage stats they are easier to game than citations, and for some altmetrics sources (e.g. Twitter) standardization is still difficult. Altmetrics not necessarily measure impact, but sometimes rather reflect attention or self-promotion. We have just started to look into altmetrics beyond the numbers, e.g. who is tweeting, bookmarking or discussing a paper or dataset. Altmetrics provide the opportunity to show the broader social impact (as Mike Taylor from Elsevier explains it) of research, e.g. changing clinical practice or policies.

Contributions

One important aspect to attribution is contribution, i.e. what is the specific contribution by a researcher to a paper or other research output. An International Workshop on Contributorship and Scholarly Attribution was held together with the May 2012 ORCID Outreach Meeting to discuss this topic. Authorship position (e.g. first author, last author) is used in some metrics, but overall the contributor role is still poorly appreciated in most metrics. David Shotton has proposed a Scholarly Contributions and Roles Ontology (ScoRO), and is suggesting to split authorship credit in percentage points based on relative contributions, but I haven’t seen these numbers used in the context of metrics.

Conclusions

Persistent identifiers for people, attribution and metrics are closely interrelated and we have seen a lot of exciting developments in this area in the last two years. The widespread adoption of ORCID identifiers by the research community will have a huge impact on scholarly metrics. But with all the excitement we should never forget that a) there will never be a single metric that can be used for research assessment, and b) that scientific content will always be more important than any metric. I look forward to a great panel discussions on Thursday, and welcome any feedback via comments, Twitter or email.

May 23, 2013: Post updated with minor corrections and additions.

New DataCite / ORCID Integration Tool

Martin Fenner — Sat, 18 May 2013 12:05:34 +0000

A new service allows researchers to add research datasets – and other content with DataCite DOIs, including all figshare content – to their ORCID profile by integrating with the DataCite Metadata Store. The tool is an adaption (or fork) of the CrossRef Metadata Search developed by Karl Ward, and was developed by Gudmundur Thorisson and myself as part of work in the EU-funded ODIN project. More details can be found here.

There are many things I like about this new DataCite/ORCID integration tool:

it makes it easier for researchers to get credit for their research outputs.
it shows the value of persistent identifiers for data, publications and people, and linking them together
it shows the Creative Commons licenses for DataCite content where this info is available, facilitating reuse of content
it demonstrates the power of open source (thanks CrossRef!), open collaboration, standard REST APIs, and lightweight programming (Sinatra/Ruby) and deployment (Vagrant, Amazon EC2, Rackspace) tools
it shows that we don’t need a single – often closed – system, but open services that build on top of each other using accepted community standards. Tools using the ORCID API can immediately reuse the new DataCite content, altmetrics provided by ImpactStory are a good example

I want to explore some of these ideas in the panel Attribution: Managing Provenance, Ethics, and Metrics at the combined ORCID/Dryad Meeting in Oxford next Thursday.

Announcing Markdown for Science Workshop on June 8th

Martin Fenner — Wed, 08 May 2013 22:53:00 +0000

On Saturday June 8th – exactly a month from today – the PLOS San Francisco offices will host a workshop/hackathon about using markdown for science. A lot of people are experimenting with markdown for authoring scientific articles – see blog posts here, here or my post here, and the scientific manuscript here.

Markdown is a simple markup language for text, and is primarily used for HTML content on the web, but can also be converted to PDF, LaTeX and others. One challenge with markdown is that there are a number of slightly different “flavors” out there, from the original markdown to multimarkdown, github-flavored markdown and pandoc. Some of the advanced formatting of scientific documents – tables, citations, math – is still a challenge for markdown.

Will markdown become our next authoring format for scientific content? Will there be yet another flavor, scholarly markdown? How will markdown writing tools be different from LaTeX tools or Microsoft Word? If you care about any of these questions and are in or near San Francisco, join us on for all full day on June 8th. Free registration is open at http://mdsci13.eventbrite.com. We are collecting workshop ideas at https://github.com/karthikram/markdown_science/wiki/workshop, the Twitter hashtag is #mdsci13.

This event is organized by Stian Haklev and myself, with generous support by a 1K Challenge prize from Force11, and hosting provided by PLOS.

Baby steps toward better metrics

Martin Fenner — Tue, 16 Apr 2013 08:22:18 +0000

Article-Level Metrics provide new ways to look at the impact of scholarly research. Two important concepts are a) to track metrics for individual scholarly articles instead of using numbers aggregated by journal, and b) to go beyond citations and also include usage stats and altmetrics.

Article-Level Metrics is also doing something else: instead of tracking impact by year, it looks at usage, altmetrics and citations in real-time. There might have been technical reasons to do so 20 years ago, but there really is no longer any reason why scholarly impact should be tracked on a yearly basis in 2013. Unfortunately there is one big stumbling block:

The publication date of a scholarly article is often difficult or impossible to obtain. Publication year may be the only available information.

A good example is CrossRef. They provide a lot of interesting metadata about an article and make this information available in a very nice search interface. But they only require the publisher to provide the publication year, information about the publication month and day is optional. There are many other examples of journals and services that just can’t tell you when exactly an article was published. This might have made sense when periodicals were printed on paper, but doesn’t work for digital content.

You should be able to install my software in less than one hour – or why DevOps is important

Martin Fenner — Sun, 14 Apr 2013 10:45:48 +0000

Cameron Neylon yesterday wrote a great blog post about appropriate business models for shared scholarly communications infrastructure. This is an area I have also been thinking about a lot recently, and in this post I want to add a technical perspective (and an announcement) to the discussion.

DevOps is an important trend that brings software development and administration of IT infrastructure closer together. Agile software development, server virtualization, cloud infrastructure and software automation tools such as Chef, Puppet or CFEngine are an important pars of DevOps, but it is really the collaborative aspect of IT administrators working much closer with software developers what defines DevOps. The end result is often faster and more stable software releases, and that is what is users and customers care about.

This makes DevOps particularly relevant for all areas where innovation is important, and that of course includes tools and services for Open Science. We not only need infrastructure that facilitates software development (with services like Github, among many others), but we also have to streamline IT administration. The question is not whether you do your development in Java, Python, Ruby, PHP or Javascript, but how well you integrate your software development and IT administration. The shift towards web-based tools has centralized software installation and updates, but these web-based services are becoming increasingly complex and difficult to set up and administer. Running an institutional respository, research information system or a journal is a complex task. The software may be freely available as open source (e.g. Dspace, VIVO or Open Journal Systems), but the resources required to run such a service still make this a big investment.

Two solutions to this dilemma are to pay either a vendor for installation and maintenance, or to use the software as a service (SaaS) that is hosted somewhere else. Why these two options are popular, they may not always be the best choices because they mean that you are locked in to a particular vendor or service provider, and that you may give expertise and direct access to your data away. I believe that these are helpful approaches for auxillary services, but that ideally the core services of a library, publisher or other provider of scientific infrastructure should not be outsourced. Developing software for scientific infrastructure that you want organizations to install locally should therefore always include work on integration with IT infrastructure, and just providing manual installation instructions isn’t good enough anymore.

Article-Level Metrics (ALM) and the related altmetrics are becoming increasingly popular. The collection and display of this information is a complex process, as it requires the integration of information from several upstream APIs which may be temporarily unavailable, have changed their data format, or put up restrictions on how you can use the data. In turn this information has to be processed and aggregated, and then reliably be provided to downstream users. This kind of information gathering fits perfectly with a service provider model, and organizations such as Altmetric, ImpactStory and Plum Analytics. PLOS is collecting and displaying this information with its own tool. The simple reason is that PLOS started doing this several years before the services above became available, and none of them currently provide the same comprehensive set of information about citations, usage stats and altmetrics (although there are of course a lot of things they do better than the PLOS ALM application).

But there is also the question of whether Article-Level Metrics are a core service for every publisher and are best collected in-house. This not only makes it easier to collect information from some sources (e.g. usage stats or CrossRef citations), but also gives unrestricted access to the data in real-time. When I took over as technical lead for the PLOS Article-Level Metrics project last May, I therefore not only worked on improving the ALM application for PLOS, but we are also working hard on making it easier for other publishers to install and use the application. We want to provide an attractive alternative for organizations for which the service provider model is not the best option.

To that end I want to announce the latest feature which allows the automated installation of the PLOS ALM application on an Amazon Web Services (AWS) EC2 instance. This option is great not only for setting up an ALM production service, but because of the EC2 pricing model by hour (about $1 a day for a small EC2 instance) without setup costs is a great way to test-drive the application for a publisher, to analyze a particular set of papers from different publishers for a research project, or to set up a PLOS ALM server for a hackathon or workshop.

There are of course many options to automate software deployment on a production server, including the PaaS (platform as a service) providers Heroku, CloudFoundry and OpenShift, and the recently announced Amazon OpsWorks. I am a big fan of the Vagrant software development tool in combination with Chef for automation, and in March Vagrant added support for Amazon AWS. This makes deployment of the PLOS ALM application to AWS really simple:

Install Vagrant and the vagrant-aws plugin
Setup an Amazon Web Services Account
Check out the PLOS ALM source code from Github
run the command vagrant up –provider aws

Step #4 took 898 sec or about 15 min on my computer (see screenshot), and at the end I had a PLOS ALM server where I could access the admin dashboard via the web interface. If you are familiar with Amazon Web Services – you have to think about the right size for the EC2 instance, an appropriate AMI, security groups, elastic IPs, and DNS service – then the whole process should be done in well under an hour. I will use this instance to load and analyze some articles from a publisher for a presentation next week. When I’m done, another command (vagrant destroy) will destroy this server and Amazon will stop billing me. During testing I have created and destroyed many servers, and the vagrant-aws video shows you how easy this process is.

At this stage the installation process is working (and has been working for a local Virtualbox install for many months), but needs testing and documentation. I therefore invite everyone interested in testing this out to contact me so that we can make this well-documented and working reliably.

Mendeley and Elsevier

Martin Fenner — Thu, 11 Apr 2013 05:50:36 +0000

Earlier this week the rumors that started in January became official: Elsevier is buying Mendeley (see also here). A lot has been written about this announcement, in particular about the fear that Mendeley as a product and organization will turn into something not as open and collaborative as before.

I first met Victor and Jan from Mendeley in 2008 and did an interview with Victor in September 2008. We worked together in the organization of two Science Online London conferences (2009 and 2010, together with Nature.com and others), and my current job started with an entry for an API programming contest co-organized by PLOS and Mendeley, with the first lines of code written in the Mendeley offices during the Science Online London 2011 hackathon. I wish Mendeley all the best with their new parent.

What this acquisition signals to me is that commercial publishers are now moving into the software tools for scientists business at full speed. They have always done this, but with ReadCube by Digital Science (a Nature Publishing Group sister company) in 2011, the acquisition of Papers by Springer last year and now Mendeley, reference management now often means using a tool owned by a publisher – this market used to be dominated academic software such as Zotero and commercial software vendors such as Thomson Reuters (Endnote) or ProQuest (RefWorks).

For me this trend signals that publishers have realized that we are moving into an Open Access publishing model, which in contrast to subscription publishing is not about owning the content, but about providing valuable services around content that is free to read and reuse.

Comment: the case for open preprints in biology

Martin Fenner — Sat, 30 Mar 2013 09:56:05 +0000

Last week Philippe Desjardins-Prouly et al. published the article The case for open preprints in biology – naturally as a preprint on figshare. The article sees preprint servers as a great opportunity for open science, and discusses the status of preprints in the biological sciences. In this blog post I want to add some comments to the text.

E-BIOMED

What is now PubMed Central started out as E-BIOMED in 1999 and initially was envisioned to include a repository for preprints. It is important to look back at what happended then, and why the preprint repository was dropped from what then became PubMed Central. Harold Varmus talks a bit about this in this interview from 2006.

Nature Precedings

The article talks about why biologists have not developed a culture of sharing preprints. It would be good to mention Nature Precedings, a preprint server for the life sciences started in 2007 that stopped taking new submissions in 2012. This blog post on RetractionWatch cites the announcement by Nature Publishing Group (which doesn’t explain why the service was shut down), and there are a good number of interesting comments.

SSRN

Preprints in other disciplines are mentioned in the text, in particular ArXiv, but also RePEc. I would also include SSRN (Social Science Research Network), which uses a different model, but is as important for the working paper and preprint culture in the social sciences as ArXiV is in physics/mathematics.

Google Scholar Metrics

In April 2012 Google launched Google Scholar Metrics, listing the top 100 publications (according to their h5-index) in several disciplines. Six out of the top 10 publications in physics/mathematics are ArXiV sections (arXiv Astrophysics (astro-ph) is #2), the IZA Discussion Papers are #1 in Social Sciences, and the NBER Working Papers are #1 in Economics, and arXiv Astrophysics (astro-ph) is #12 on the top 100 list for all disciplines (#1-5 are journals in biology and medicine: Nature, New England Journal of Medicine, Science, Lancet, Cell). All these metrics are a strong indicator that preprints can be highly cited.

Citation Advantage of Preprints

Anne Gentil-Beccot et al. have written a nice paper (of course available as preprint) that shows that publication as preprint now only increases the citation rate for the corresponding peer-reviewed article published later, but also leads to much faster citations, with a peak immediately after publication.

Average number of citations per article per month as a function of the time of the citation relative to the time of publication. From http://arxiv.org/abs/0906.5418

SCOAP3

The Sponsoring Consortium for Open Access Publications in High Energy Physics (SCOAP3) is working on turning the majority of peer-reviewed publications in high energy physics into gold open access. It is important to understand that the high energy physics community feels that they need peer-reviewed journal articles in addition to ArXiV.

Preprint Culture in Clinical Medicine

It is a little known fact that there is a strong preprint culture in clinical medicine. I have written about this topic in October 2010. Clinical trials have to be registered before starting the trial, and information about the trial is publicly available in clinicaltrials.gov and other registries. Results are presented in conferences (as poster or oral presentation), at which stage it becomes public information. The peer-reviewed paper – with a few exceptions – follows much later, sometimes even after drug approval by the FDA (in the blog post I used the TROPIC trial as example). The problem is of course that information in oral presentations and posters is incomplete and difficult to find. But publication of a clinical trial in a peer-reviewed journal is more about giving credit to the researchers involved (similar to SCOAP3 in high energy physics) than about spreading the knowledge. Peer review is not an appropriate filter for whether or not a new drug or drug combination should be used to treat patients – the approval process by regulatory authorities is much more extensive than any peer review can ever be.

Preprint culture in biology

The paper mentions several reasons why the field of biology has essentially no preprint culture. One argument against preprints is that it would be easier to steal ideas. Although I agree with the authors that preprints are a great way to establish precedence, there is a big difference between research based on years of work using expensive equipment (as is often the case in high energy physics but also some other fields), and research that can be reproduced in a few weeks. In the latter case it is possible that someone else is faster in publishing the peer-reviewed paper. Another difference is the community: “stealing” ideas from someone else is probably more difficult in smaller scientific communities, and some scientific communities are more competitive and less collaborative than others.

Another concern about preprints raised in the paper is the Ingelfinger rule, i.e. the uncertainty that a journal would accept a manuscript if already published as a preprint. This concern is fortunately unfounded regarding most publishers, and the paper includes a table listing the preprint policies of important publishers in biology.

I would like to add two other reasons why the preprint culture is probably not established in biology. Preprints are competition for the peer-reviewed journal article and scholarly publishers might not be particularly interested in encouraging a preprint culture. A lot has fortunately changed since E-BIOMED in 1999.

Finally, whereas some disciplines use preprints and working papers to communicate, in biology the preferred way to communicate research findings before publication of a peer-reviewed paper is the oral presentation. What we may need is a service that makes it easy to upload and share scientific presentations. We for example already have Slideshare, Speaker Deck as generic tools, and SciVee, fishare aimed at scientists. Speaker Deck is currently my favorite tool and is a Github product (Github has been mentioned in the manuscript as an option for hosting preprints). Maybe what is missing is a killer combination of features in a new or existing service – persistent identifiers, uploading of background material (text, data, software, video) in addition to the slides, non-textual search, cooperation with conference organizers, etc. – for presentation sharing to take off as a way to establish a preprint culture in biology.

Philippe Desjardins-Proulx, Ethan P. White, Joel Adamson, Karthik Ram, Timothée Poisot, Dominique Gravel. The case for open preprints in biology. Figshare; 2013. Available from: http://dx.doi.org/10.6084/M9.FIGSHARE.655710

Update 4/4/13: Yesterday PeerJ launched a new preprint service for life sciences research. Read this blog post for details, and this post on the Mendeley blog.