Laura Ward, João Bauto, Hugo Cachitas (Champalimaud) @ Ready for BioData Management
On July 2nd we got together for a hands-on experience in Data Management Plans (DMP). The event started with some presentations and then participants had the chance of putting their heads together to create a DMP for a project. During lunch we went around to ask participants their opinions about data management and the working session they had just participated in.
We talked to Laura Ward, João Bauto and Hugo Cachitas from the Champalimaud Foundation, about the perspective from software platforms, long-term usability and the difference between human and animal data.
Bruno Costa: So what did you think of this experience?
Laura Ward: I think it's useful because you had the research, you had software platform stuff, and Pedro to kind of coordinate everything. It's kind of interesting to see different people's opinions. And it was very confusing. We still have no solutions now but we realized-
João Bauto: You have way too many options.
LW: A much broader overview of -
JB: The kind of detail that you probably have to get in. It's something that we, at least us from the software platform, are not aware. So it's something that, when in connection with the people that do the management, it might help us do it. But on the long term I think it will be difficult to apply this to most of the grants I think.
BC: But you can go through a protocol to see which kind of areas you should try to focus on.
LW: It's good to identify the bare minimum, for sure. Because this level was really detailed and I think this will be really difficult to implement. Unless there's somebody whose job that is, right? Because getting a P.I. to go through that level of detail is nigh on impossible.
BC: Technically it should be like a one or two day experience where you go through this with all of the people involved in the project. And we go through something like an exercise like this, we go through all of the steps and try to get the bare minimum for the project and plan ahead what would happen.
LW: If it is for a project than I think it could be part of the annual meetings. It just has to be done at the application stage, before then it's quite a big commitment. And obviously as they get further, you do them more often then it will become more routine.
BC: I guess it would work maybe not as you’re applying for the grant but once the grant is is given. I think this could be the first step.
LW: Yeah, maybe even before it started. So once you get the funding that's kind of a little bit of dead time in between, isn't it? That would be a good time to do this kind of exercise. Because people won't be necessarily totally ingrained in the experiments. And they're also supposed to guide you on how to do your experiments! So you don't do it too late because everything will already been set up, probably incorrectly.
BC: So was this workshop useful?
LW: Yes, it was really useful to me. I already tried to write a DMP, really basic, because I didn't have much idea of what I was doing, so yes really useful.
BC: What is your name? I didn't catch your name.
JB: I'm João, I work on software platforms and I'm basically data curating / data storage so I'm more focused on that. For me, I'm more interested on trying to get some kind of standards on how to save data or how to share it, than on how to focus on the management of the project itself. So we have centralized data storage on Champalimaud and we're still trying to figure out how to process everything. What should we save. How it should be saved. And in terms of long term solutions when we have to save data, how it should be done and implemented.
BC: For software there should be a different kind of requirements.
Hugo Cachitas: Yeah. So for human data, it's pretty different than the one that we are most used to working with, which is animals. So we don't have to care about anonymizing the data.
LW: Because at Champalimaud it is sort of slightly separated, isn't it? But they still don't have anybody specifically alocated to the management.
HC: Yeah, but we have to provide the access to computational for both animal and human data. It's kind of important for us to do the separation of - making sure that the data that we are running cannot be addressed to a specific person. So it's important for us to make sure that we don't have any data that it's not anonymous.
BC: What is your name?
HC: I'm Hugo and I also work at a software platform, developing specific databases on a request basis. I was recently asked to check on a DMP for a researcher that was submitting it for a grant. I don't know how it went down but- So when I help them and they have a simple request sometimes, they want to sort their data in a s-
BC: But once you're preparing the structure of the database, you always have to think of how this should be implemented and how the data should be stored, what kind of tables. This is just another step to consider when making the management plan.
HC: Yeah. Yeah.
BC: How would this differ from the process of constructing the architecture of the data model and the start of data?
HC: I never looked into it from the DMP perspective. More like the usability of it. But more for the work they do, not to share it afterwards. Although that is important and I think we're kind of converging to that afterwards.
BC: Otherwise they're just going to be data silos.
HC: Which makes no sense.
LW: We talked a bit about whose responsibility it is of sharing, as well, right?
HC: But it's more about the data formats, I mean, because we are not following any convention for now.
BC: You can convert that on the fly, I guess. That it's not necessarily the biggest problem. Just ensuring that beforehand you know how data is going to be processed and that you know who can have access to it. What kind of data, how it’s planned in case something is lost, if it's replicable...
HC: For now, it's just a matter of having the fields they need to conduct their research, the axis is on a group level. So it's very confined at this stage. But then, yeah all of these steps that the DMP requires, we also need to think about them for this first level. Let's say we take care of the backups: where to host the database, if it's internal / private, if it has patient data, if it can go to Amazon, or something more public. Or that we have no control over the machines. And so all of that we discuss beforehand.
BC: So implicitly there's a kind of- that's a data management plan! But without a formal structure.
BC: Do you think this workshop is going to contribute to change a bit the process with which this is done? You see the need to refine some of the aspects of your current-
HC: The process of it I'm not sure. But at least it was really helpful to identify a lot of problems that arise in all of these, when you are trying to prepare a project like this. So we actually spent- while some were doing the exercise, others were discussing the points that were arising, and what they usually do, and what is done in terms of other countries’ policies. Because there's not really any conventions between the USA and Europe and etc.
LW: It's quite nice not to have the pressure of having to- like write this DMP now, it’s due next week. And actually just have the time to think about it instead. Which is kind of the whole point of the DMP in the first place, isn't it? To make people kind of consider-
BC: If you do it beforehand, you don't have the pressure of having to fit everything somewhere just to get it done. You can actually plan how it should accurately be done, and anticipate all of the issues that can arise.
HC: Yeah but, as we discussed, it is really hard to anticipate most of the points that the DMP requests because it changes. It can change so much.
BC: But as long as you think of some of these issues that can arise, I think you can plan for them. And when they actually happen, you can see-
HC: Yes, at least we have a plan, an initial plan. And then you can know how much you are deviating from it. It's good to keep track of it. But I think the value of the DMP is to replicate scientific results. Besides a publication you also have the (BC: usability) yeah, and the track of all the data processing. But I think the scientific community doesn't value that enough. So it's a lot easier to publish new results than the results that are revisiting old stuff. And so I don't know if, even if you have a common or a structured database with standards across countries and the research units, I don't know if people are really going to give it the use it deserves. Because then there's no outcome. Or at least in the short-term.
BC: I think in the long-term it will enable interoperability. It will enable usability of the data. Because if you have a detailed provenance of all of the data and the descriptive process of how the data was generated, people can reuse it. And it provides a broader usability of the data, because you can aggregate data from different sources.
HC: In that sense, yes. You can complement your datasets with datasets from other people and it can increase a lot your dataset, if it fits. It also needs to be-
LW: It's the quality assurance thing that is the most difficult, right? And that's part of the data management plan and we've talked about it a bit. But it's really difficult to know how to do that. How do you know that somebody else's data is-
HC: Jorge was saying that there is this repository where everyone can put things in there, but they're not curated. So when you go there and take something out, most of the time it's either incomplete or it has some errors in the middle. So if you take the time, you can select the data that really helps you. But most of the time you can just throw it out. So that's a lot of steps still needed to really have this.
BC: You need a checklist to ensure that the data that is being deposited meets the minimal amount of those criteria.
HC: We also talked about that. You cannot do that on the management side of it. Who hosts the data doesn't have the resources, usually, to have a team there to curate all of the data. So you need to enforce it when you upload. But then if you're a researcher and you go upload your data, if it is too hard or if it takes a lot of steps and the audit is never compliant with what it needs...Unless you are forced by the funding agency, you're not trained to do it.
BC: My only solution to that is to provide a score for the data you uploaded. Based on that score you can rank it. That could entice people to get a better score for their data.
HC: Depends on how you score it. In a movie database, sometimes a good score doesn't mean the movie is good so.
LW: Doesn't mean that the data is interesting!