"Once you register to download the CheXpert dataset, you will receive a link to the download over email. Note that you may not share the link to download the dataset with others."
Volunteers are collecting data from original sources. Better to wait for that than to go through the papers ourselves. Can get a lot more data that way.
Many datasets will not have the possibility of duplicate images. But with many researchers independently assembling datasets, eventually this is likely.
We can either:
Only work with fundamental datasets, not any aggregators
Attempt to de-duplicate images
Let's try to stick to the first option. Any dataset should show its sources. Eventually we will likely hit a dataset where we can't drill down. At that point we will need to assess how to de-duplicate and what compute resources this will require.
The submodule should probably contain folders for converted and unconverted images. It should also hold the code which converted the JPEG to DICOM.
This is because DICOM images have more dimensions than other formats and those dimensions have to be filled in somehow. The assumptions should be documented.