Skip to content

Commit

Permalink
Add raw feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
richelbilderbeek committed Dec 20, 2022
1 parent c9a3622 commit 2727006
Showing 1 changed file with 145 additions and 0 deletions.
145 changes: 145 additions & 0 deletions feedback/feedback_20221209.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Feedback 2022-12-09

Dear Richel,

Many thanks again for participating in the anthology “Perspectives to Paradata - Research and Practices of Documenting Data Processes”. Refereeing for your chapter has been completed and I’m glad to say that the reviewers thought it was an interesting and relevant contribution to the volume that should be accepted with only minor revisions.

You can find the comments of the two independent reviewers below. Please also make a specific note of the following feedback:

* Think about the line of argument from the beginning of the chapter to the end. It works mostly rather well but sometimes you have rather abrupt transitions, individual claims and notes that are not discussed in much detail and make following the text unnecessarily difficult. Your examples, both in the introduction and the genetic epidemiology but you use them rather little in the text. By using these fields to exemplify why publishing code, taking care of sensitive data etc. would make your argument much stronger.
* I am also inclined to agree with the first reviewer in that it would improve the text if you would nuance your discussion a little by acknowledging that there are also problems and limitations to assuming code as a perfect form of paradata, and e.g. that keeping it more or less automatically makes research reproducible.
* Be careful with the style. A book chapter like this can be somewhat more conversational than an average journal article, but colloquialisms are still not exactly encouraged. Check e.g. for abbreviations I’ll vs. I will etc.
* When you switch from Vancouver system to (Springer) Harvard system of referencing, check at the same time how you refer to earlier work. Preferably spell out researchers names and studies e.g. “example is described in [11]” ⇒ example is described by XX and YY.
* Remember to tell what the things you mention are. Right now it might sound unnecessary to tell what GitHub is but in a few years no one might remember anymore what it was.
* You have a tendency to make strong claims (which is fine) but make sure that you don’t contradict yourself and make strong claims that are too easy to challenge. E.g. you insinuate that code is not published that often but write also “Code is commonly published alongside of a paper, pasted as text in the Supplementary Materials” ⇒ that code would be commonly published.
* The abstract reads right now more like a part of the argument than an abstract (summary, description) of the chapter as a whole.
* Introduction
* You could think about the structure of the introduction to make the line of argument clearer.
* The vignette in the beginning of the introduction is good but does not connect with what follows.
* p1 l 43 The sentence starting here would need references.
* p2 Paradata figure is really informative. Considering that there is really no consensus on what paradata is, it might be good to formulate the text as a proposition i.e. if paradata is understood in this way, it would imply that.
* The usual order of conclusions and discussion is first to have a discussion and then conclusion. Also now the discussion reads more like a recap of what has been said already with stuff that sounds more like (practical) implications .May be the recap could be done in the conclusions and the practical implications could be place right before them in a separate section?

In addition to this, make sure to check the beginning of your chapter and make sure that it clearly tells heads up:

* What is the domain discussed (e.g. archaeology, survey research) explained in terms that make sense for a non-expert.
* A very brief problem statement: Why paradata (or something related to it) is an issue in that domain i.e. why paradata is (not) needed, why it is (not) difficult etc.

This helps non-specialists in your area of expertise to understand the context of your chapter and the reason why it was written and to be included in this volume.

Additional things to verify are:
* Check that you have all the references in the list of references and that you cited all references in the list.
* Check that the appropriate form of data in plural is used, i.e. data are.
* When revising your text, please try to be mindful of the indicative max. number of words per chapter (5 000 words + references, in practice an ideal total number of words is around 7000).
* Check that you followed all of the stipulations in the author instructions (September 2022) for the Perspectives to Paradata volume at https://uppsala.box.com/s/r19l07ex4kdxwspz8400ojoz6tsseze0 in the Material for authors folder.

Please consider these comments and make appropriate changes to address them in your manuscript. Submit the following documents directly to me by email:

* Revised chapter manuscript
* Short letter, briefly describing major revisions and/or decision to not revise in accordance with reviewer comments’.

Send the two documents to me by email at the latest Feb 13, 2023. There is also a possibility to join a Q&A in Zoom about paper revisions Jan 17th, 2022 5pm-6pm CEST at https://uu-se.zoom.us/j/69966856109 (passcode: PARADATA).

best wishes,


Isto


-------------------

Review 1

Overall Comments
* The paper seems to purposefully have a provocative title and theme, and I think that's OK. However, we do need to consider the audience and whether they will know that it is controversial and why. A computer science audience would know, and would probably be able to come up with multiple arguments against the theme, possibly pushing back very strongly. This would be an interesting conversation; I don't mean to imply that the theme of this chapter is absolutely wrong.
* The three biggest arguments against the thesis are:
* Code is generally considered to be difficult, sometimes near impossible, to understand. So, while code might be a clear and complete encoding of the paradata for some research, it is a poor way to communicate that paradata to humans.
* Programmers typically these days write code that draws on libraries and interfaces written by others, and so is infrequently self-contained. Thus, even if the code written for some research happens to be extraordinarily clear for humans, understanding it fully likely requires great contextual knowledge about the software written by others that is merely reused. Even if code is "self contained", its operation may be altered by tools use to compile or run it (compiler, environments like Matlab, even operating systems and hardware).
* It is generally considered to be difficult to predict the behavior of complex software, even if one understands the code. This is a fundamental outgrowth of the discrete nature of software. Therefore, while the code may be a complete encoding of the paradata that produced certain data, even if the software is understood it may not be the case that the underlying meaning behind the data is understandable. Thus, the paper becomes not mere optional metadata, but essential interpretation of something that otherwise would be opaque to anyone who didn't write the code (and, often, after a short while, even those who wrote the code).
* So, some might argue that the code is only trivially considerable as paradata, or that it is paradata that is only consumable by computers, and that the paper (or supplements to the paper) is essential to understand the paradata.
* The abstract shades the more provocative title. It is certainly true that the code is the true encoding of the paradata and is primary when there is disagreement with the research paper. Moreover, open/reproducible science would demand that the code be supplied. That doesn't mean that the code is stand-alone.
* Overall, I think this chapter could trigger interesting debates of many kinds, but given the venue it would be good if this chapter embodied the possible debates, rather than merely taking one point of view and leaving it to the reader to develop (unanswered) counter-arguments in their mind. However, I recognize that that opinion may not match the opinion of the editors.
Detailed Comments
* Introduction
* It is often the case that the code changes over time as the research progresses. Guaranteeing that there is a perfect match between code and results can also be difficult; in fact, it can easily be the case that different results in the same paper connect to different versions of the code, often with subtle differences. So, it may not be the case that an experiment can be re-done easily without careful data/software provenance that carefully tracks the precise version of code that produced each result (1,2).
* Having written that, it is an excellent point that computational research that does not supply software is inherently less valuable, and under some circumstances potentially suspect.
* Not only does supplying software better support reproducibility and so increase trust in conclusions, it also enables discovery of bugs that can invalidate results. Producing results that are potentially falsifiable is inherently more valuable that producing results that cannot (easily) be falsified. (Some would argue that the latter is not even science.)
* Figure 1: The text in the figure is too small and cannot be read without significantly magnifying the document.
* Is figure 1 simplistic? Does metadata only pertain to paradata in computational research (left)? In other words, can only reconstruct metadata given output data and code? Maybe. However, it certainly is the case that a paper is not mere metadata; doesn't a paper also interpret output data — assign meaning and context within the field of study (right)? Is that really something one can extract from the code?
* The ending of the introduction shades the claims of this chapter even more. Perhaps what is meant is that the code is primary and the methods section of a paper is secondary. I'm not suggesting that you gut the provocation of this paper's title/theme, but maybe that you shade it more clearly, and maybe earlier, as written deliberately to provoke thought and conversation about paradata and eScience.
* But I will say that reading the code rarely allows one to assess the trustworthiness of the experiment, because it is a rare situation where the code can be read and understood!
* OTOH, having the code may allow reuse, assuming that there is a reliable mechanism to collect together the execution environment, including correct versions of all of the third-party dependencies.
* Why is table S1 so far from its internal use? Note to the editors: there may be a need to have a single glossary for the book, or a consistent format/location for glossaries within each chapter.

* Genetic epidemiology
* Note to the editors: might need to consider style to use to use citations as nouns within sentences.
* You mention "table S3", and because there is no table S3 in your chapter, I assume that this is a table in citation #2. This raises the question of why you have a table numbered "S1", which causes the reader to be confused here.
* I think that it's fine to have an informal style, but using phrases like "spoiler alert" is distracting and does not improve readability.
* L. 133: typo: "fictuous" --> "fictitious"

* Code is…
* I don't think that the idea that code is paradata should be controversial, so good to state that clearly.
* It's a very good point that software is commonly not made available. Reproducibility/replicability and open science is still in a sad state in many cases, as indicated.
* How unique is the "code vs. description" aspect of paradata in this case different than other areas? There is always the case that we deal with descriptions of processes, rather than the processes themselves. The distinction is more that, in the case of software-enabled activities, we have code that (more or less) completely describes the process — the algorithms. In other cases, we don't (unless there happens to be something like video of the processes, which would be highly unusual). So, I think there is an important point here.
* L. 225: Indicates work by the author, but the author of this paper is not the author of the reference cited.

* There are multiple…
* This is a good summary
* §4.1, ¶2: repeats previous points.
* Lines 278-286: This is true, but I'm not sure what this contributes to the chapter.
* L. 297: spell out "FAIR"?
* §4.2:
* might also mention Zenodo.org, supported by CERN.
* Sites like github provide metadata: This is true. So, we have metadata about paradata. In fact, features such as continuous integration scripts and other tools that support the process of software development means that the code (paradata) has its own paradata!
* L. 351: broken figure reference
* L. 378 onward: Besides the "decay" of software over time, there is also the issue that a particular publication needs to be connected to a particular version of software in a repository. If software is changing over time, publications at different points in time will likely connect to different versions of the software. As a result, understanding the results across multiple publications, even if coming from a single lab, could require careful consideration to understand if they are comparable (or if the changes in software render them incomparable). Unfortunately, this may even be true about different pieces of the results in a single paper. (This is an amplification of the good points in the paragraph starting on line 390).
* §4.3
* Not clear what the "access rights" comment means on line 417.
* Containers like Docker and Singularity alleviate some problems, but introduce others. First of all, they require users to learn new tools they may be unfamiliar with. Second, they may require installation of new software to run them. Third, they themselves are third-party software that can introduce unexpected dependencies, weaknesses, errors, and software decay. Containers can be helpful, but they are also software that can have their own execution challenges.
* L. 450-453: very good point.

* Discussion
* Why both conclusions and discussion?
* This section starts with a sentence that appears to undercut the whole theme of this chapter!
* Frankly, I'm not sure what to make about the first paragraph of this section. What is its point?
* Next few paragraphs appear to summarize the chapter, which can be good but not really discussion.
* Good last paragraph in this section!
* Table 1 is not referenced anywhere in the paper.

Citations
* L. Pouchard, S. Baldwin, T. Elsethagen, S. Jha, B. Raju, E. Stephan, L. Tang, and K.K. Van Dam, "Computational reproducibility of scientific workflows at extreme scales", Int. J. High Perf. Comp. Appl., vol. 33, pp. 763–777, 2019.
* J. Conquest and M. Stiber, "Software and Data Provenance as a Basis for eScience Workflow", IEEE eScience, September 2021.

Review 2
Is the chapter comprehensible for a reader not coming from the specific domain the chapter is discussing?
Bilderbeek's chapter does a nice job of introducing domain knowledge from genetic epidemiology in a way that is comprehensible to the lay reader.

Is the chapter clear in how it ties to the theme and topic of the volume?
Yes, the author makes a clear declaration at the outset of the paper about their understanding of the term paradata [suggestions about how to hone this understanding are included below].

Can you see obvious links to other chapters you think the authors could highlight more?
Yes, there are several papers that speak to the idea that algorithms (and parts thereof) can be considered paradata. These threads of 'code as paradata' can be drawn out in the introduction to the volume and within these chapters themselves. Where these chapters differ is in the understanding of whether it is just code or code and additional process data that counts as paradata. The difference may relate to the overarching concept of interest. Viewing paradata in terms of something that allows for trust to be established in the primary data of interest (the research results of a computational experiment) hones in on code as the central component. While looking at the idea of holding people accountable for computational systems means that code is just one part of a broader ecosystem of paradata that can fulfill that function.

Other comments
I would suggest reworking the paper title. The are some minor disfluencies with the English language to be rectified here and this is one of them. An elaborated definition of trust/trustworthiness should also be included as that notion differs across disciplines.

Introduction
In the opening paragraph the example of the two researchers and their output would carry more weight if the scenario was fleshed out a little more - e.g., discipline, etc.
Figure 1 doesn't quite succeed in representing the understanding of paradata as deployed in the text. E.g., the notion that paradata describes a process that feeds into the collection of data comes across clearly in the text but not in figure one. The same goes for the notion that paradata describes the way of collecting primary data. I'm also not sure I understand the difference between paradata as a description of a process and as a way of data collection so this could be clarified here. How metadata factors in here also needs to be briefly outlined in the introduction [I know that is covered later in the paper, but some foreshadowing would be useful here].

Genetic Epidemiology
There needs to be more of an introduction to the genetic epidemiology study - i.e. a short description of what the author will achieve in this key section of the chapter.

Code is More Important than the Paper
I buy the idea that the scientific article is metadata, and I think the chapter raises an interesting issue about how and where credit is doled out in the research enterprise [is it appropriate, for example, that most credit goes to researchers for metadata about the research?]. Going back to figure 1, I also think that the metadata element is represented in a way that seems too dislocated from the rest of the research process.
The tie to knowledge management issues is a welcome addition to this section but given the focus on this edited volume could be nodded to earlier in the chapter [at least alluded to in the abstract?].

There are Multiple Ways to Publish Code and Sensitive Data
These sections did a great job of taking the basic argument and extending it to its logical conclusion. I learned a lot from this section and appreciated the effort involved to make rather complex ideas easier for the layperson to understand. In this section, the idea of linking these publication approaches to FAIR data needs further fleshing out. It makes sense to link the two together but in the narrative the introduction of the concept of FAIR data seems rushed (comes a little out of left field and is not fully explained).
There is a figure number missing on page 15.

Conclusion and Discussion
The conclusion succinctly recaps the main findings of the chapter. The discussion section is key as it brings the reader back to the main argument of code being paradata. Once the author has had the opportunity to read other completed essays from this volume, the conjoined force of what has been learned will help round out the ending, particularly in terms of linking paradata to data processes.
Finally, the nod to knowledge management at the end (that KM should create the infrastructure for the preservation of runnable experiments) comes somewhat out of left field. I think this potential KM contribution should be signaled much earlier in the paper.

Adding the GitHub link to the article and metadata shows the author's investment in data accessibility.

0 comments on commit 2727006

Please sign in to comment.