I hereby summarise collection of tips, learnt (the hard way) by my colleagues1 and (marginally) by myself. Contributions are welcome!
I will take the example of an image database but if what you collect is something else, say a text file, same tips apply. This begins with a curated list of good practices, then dig into lower level aspects.
(todo)
rough copy paste:
Spreadsheet are very popular to store and organize data collection and associated covariates.
Forget artistic touch on your .xlsx, .odt, .wtf files. No hierarchical header, no missing column name, no white lines here and there. Anything else than text to encode information like cell background color, bold, etc. are crazy ideas. Never do that. Use syntactic, meaningful and well thought colnames. Overall, keep things simple:
lowercase are fine. Tend to only use lowercase now for code. Saving dozens of maj key press, and muting one fountain of typos are quite convincing experiments to try at home. It’s a matter of taste.
id date location lat lon species cultivar tree branch position
Lacamp_AcMtp_001 Lacamp_AcMtp_001
Home_AcMtp_001 Pump_ Oiselette site_species_n :
#template_acq site Home / Castle / Forest1 / Forest2 # pick one collector c(“sl”, “vb”) tree 001 opening 1:5 height c(min=0, max=5, na=“none”) i numeric comment NA_character
eg: Home_mons_vb_1 Home_mons_vb_2 Home_mons_vl_3
momacs propose template photo -> name immediately when possible decharge -> validate
species c(“mons” / “camp” / na=“explicit”) # or ok,
Site_Species_open
template status: allowed(“status”, “domesticated”) cépage: meet(is_character, matches(regex)) id: domain()
complete ID (4) NA: allowed/none etc.
end tough copy paste
id
Before starting a single study or the world largest database, your will need to think of how to:
The three tasks actually reduce to a single one: finding what will be your id and how to structure it2.
There is no definitive rule for what the right id pattern but is likely between these two extremes:
00001
)chivas_whisky_0.75l
)The first case would be the canonic view of what an id is, but not user-friendly at all.
The second case does not solve the problem, it avoids it. It first sounds like a good idea until you want to add stuff and reveal the foolishness of this idea3.
I think the good compromise is to reflect the granulometry of measurements.
For instance:
chivas_75.jpg
is a better choice.
Indeed, whisky
can be deduced from chivas, and filename is not a place to decimals or units. More importantly, if you access additional information it is, by far, preferable to include them in an external file (eg a spreadsheet) than in the filenames.
So far, this example was so trivial that all of this appear pointless. But wait a bit: you got a mountain of dollars to make a longitudinal analysis of alcoholic bottles shapes throught both time and space, ie collect pictures of the different contenance, country and years. Your id could be:
chivas_75_mex_1984
When taking pictures in alcoolism museums or scrapping webpages, it is acceptable to manually type this and when working with files you do not have to find this bloody 6-digits number in the spreadsheet, you just have to read your its name.
Now you hire three post-docs to add covariates such as price, glass colour, tasting note, etc. should code in another file, the natural choice being a spreadsheet formatted this way:
id | name | volume | country | year | price | glass_colour | note |
---|---|---|---|---|---|---|---|
chivas_75_mexico_1984 | Chivas | 0.75 | MEX | 1984 | 3.14 | brown | 4.2 |
You add much value to your database, but you did not touch in single bit in your folder image.
For morphometrics collection, whether small and simple or gigantic and complex, I do not think an id as to be unique. For instance, the same id often refers to:
Even in the short-term, it’s certainly worse to repeat lines in your spreadsheet, than to create a column in your tibble! (todo link to dedicated vignette)
So, again, I think the id pattern choice should reflect the level where measurements where taken.
That being said, you are a grown up adult, so choose whatever you want and fits your present (and possibly future) needs. Just think twice about it, it could save severe headaches later (and possibly sooner as imagined).
As I told you before, this is the easy part, since you are not the first (neither the last) person who wants to join two tables. Have a look to ?dplyr::join
.
Let’s exemplify this. I will first build two dummy objects, in the MomX way.
library(Momit) library(Momocs2) library(dplyr) data_raw <- tibble(id = c("chivas_75_mex_1984", "guiness_50_irl_1988"), # two individuals coo = coo_list(list(coo_single(), coo_single()))) # two empty coo_single, but that's not our point here data_cov <- tibble(id = c("guiness_50_irl_1988", "chivas_75_mex_1984"), type = c("beer", "whisky"), price = c(1.1, 3.14), note = c(NA, 4.1) )
This appears boring here but you will never use it. Your data_raw
will likely be generated using Momit, and your data_cov
will be a read*
to bring your spreadheet into R.
Here are our tibbles:
data_raw #> # A tibble: 2 x 2 #> id coo #> <chr> <list<coo_single[,2]>> #> 1 chivas_75_mex_1984 <tibble [0 × 2]> #> 2 guiness_50_irl_1988 <tibble [0 × 2]> data_cov #> # A tibble: 2 x 4 #> id type price note #> <chr> <chr> <dbl> <dbl> #> 1 guiness_50_irl_1988 beer 1.1 NA #> 2 chivas_75_mex_1984 whisky 3.14 4.1
Let’s join them now with dplyr::left_join
.
Turning such tibble into a <mom>
can be achieved in a reasonable time by a typing monkey and you can now enter MomX pipeline:
df %>% mom() #> # A tibble: 2 x 5 #> id coo type price note #> <chr> <list<coo_single[,2]>> <chr> <dbl> <dbl> #> 1 chivas_75_mex_1984 <tibble [0 × 2]> whisky 3.14 4.1 #> 2 guiness_50_irl_1988 <tibble [0 × 2]> beer 1.1 NA #> ❯mom_tbl
Your id should be minimal: it will save time when naming cannot be automated, it will save typos, and we do not need more anyway. But where to
Yes, that’s iris
:
head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3.0 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5.0 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa
An id is an unique identifyers accross your database. Let’s take an example using the iris
dataset:
iris[c(1, 50:51, 100:101, 150), ] #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa #> 50 5.0 3.3 1.4 0.2 setosa #> 51 7.0 3.2 4.7 1.4 versicolor #> 100 5.7 2.8 4.1 1.3 versicolor #> 101 6.3 3.3 6.0 2.5 virginica #> 150 5.9 3.0 5.1 1.8 virginica
So a good id could have been of the form: species_individual
. For such trivial cases, building an id is straighforward:
Spreadsheet are very popular to store and organize data collection and associated covariates. Anything that can be read and contain an id for joining is fine.
Forget artistic touch on your .xlsx, .odt, .wtf files. No hierarchical header, no missing column name, no white lines here and there. Anything else than text to encode information like cell background color, bold, etc. are crazy ideas. Never do that. Use syntactic, meaningful and well thought colnames. Overall, keep things simple:
Lowercase are fine. Tend to only use lowercase now for code. Saving dozens of maj key press, and muting one fountain of typos are quite convincing experiments to try at home. It’s a matter of taste.
Laurent Bouby, Allowen Evin, Sarah Ivorra↩
What I call id
can also be called a primary key but I like how compact id
is (versus key
) and still explicit (versus i
for instance)↩
I love regular expressions, and even more when I do not need them↩