I hereby summarise collection of tips, learnt (the hard way) by my colleagues¹ and (marginally) by myself. Contributions are welcome!

I will take the example of an image database but if what you collect is something else, say a text file, same tips apply. This begins with a curated list of good practices, then dig into lower level aspects.

Good practices

(todo)

Raw data

All pictures should be in the same folder
Keep your raw data
Do not crop, resize, etc.
Repeat (at least some) pictures to measure error
Post-processing should be reproducible. If manual (thus error-prone) operations are needed (like artistic reconstruction of missing part), each error should be properly measured

Working data

All pictures and associated data should be in the same folder
Your analysis pipeline should be exhaustivelly reproducible

rough copy paste:

Spreadsheet are very popular to store and organize data collection and associated covariates.

Forget artistic touch on your .xlsx, .odt, .wtf files. No hierarchical header, no missing column name, no white lines here and there. Anything else than text to encode information like cell background color, bold, etc. are crazy ideas. Never do that. Use syntactic, meaningful and well thought colnames. Overall, keep things simple:

lowercase are fine. Tend to only use lowercase now for code. Saving dozens of maj key press, and muting one fountain of typos are quite convincing experiments to try at home. It’s a matter of taste.

id date location lat lon species cultivar tree branch position

Lacamp_AcMtp_001 Lacamp_AcMtp_001

Home_AcMtp_001 Pump_ Oiselette site_species_n :

#template_acq site Home / Castle / Forest1 / Forest2 # pick one collector c(“sl”, “vb”) tree 001 opening 1:5 height c(min=0, max=5, na=“none”) i numeric comment NA_character

for file names : aims for the minimal scheme ie things that cannot be deduced by any other mean than by manual typing/picking

eg: Home_mons_vb_1 Home_mons_vb_2 Home_mons_vl_3

momacs propose template photo -> name immediately when possible decharge -> validate

species c(“mons” / “camp” / na=“explicit”) # or ok,

Site_Species_open

template status: allowed(“status”, “domesticated”) cépage: meet(is_character, matches(regex)) id: domain()

complete ID (4) NA: allowed/none etc.

end tough copy paste

Choosing an `id`

Setting the problem

Before starting a single study or the world largest database, your will need to think of how to:

name individual images
store additional informations
join the two

The three tasks actually reduce to a single one: finding what will be your id and how to structure it².

Choosing an id pattern

There is no definitive rule for what the right id pattern but is likely between these two extremes:

Using a single number (eg 00001)
Encode everything in filenames (eg chivas_whisky_0.75l)

The first case would be the canonic view of what an id is, but not user-friendly at all.

The second case does not solve the problem, it avoids it. It first sounds like a good idea until you want to add stuff and reveal the foolishness of this idea³.

I think the good compromise is to reflect the granulometry of measurements.

For instance:

chivas_75.jpg

is a better choice.

Indeed, whisky can be deduced from chivas, and filename is not a place to decimals or units. More importantly, if you access additional information it is, by far, preferable to include them in an external file (eg a spreadsheet) than in the filenames.

A more realistic case

So far, this example was so trivial that all of this appear pointless. But wait a bit: you got a mountain of dollars to make a longitudinal analysis of alcoholic bottles shapes throught both time and space, ie collect pictures of the different contenance, country and years. Your id could be:

chivas_75_mex_1984

When taking pictures in alcoolism museums or scrapping webpages, it is acceptable to manually type this and when working with files you do not have to find this bloody 6-digits number in the spreadsheet, you just have to read your its name.

Now you hire three post-docs to add covariates such as price, glass colour, tasting note, etc. should code in another file, the natural choice being a spreadsheet formatted this way:

id	name	volume	country	year	price	glass_colour	note
chivas_75_mexico_1984	Chivas	0.75	MEX	1984	3.14	brown	4.2

You add much value to your database, but you did not touch in single bit in your folder image.

An id does not have to be unique

For morphometrics collection, whether small and simple or gigantic and complex, I do not think an id as to be unique. For instance, the same id often refers to:

several images (eg two orthogonal views)
data of different type (eg images and txt file for landmarks)
repeated measurements of the same object (eg when measuring user error)
different sub objects (eg pips of the same grape cultivar for which we have additional information)

Even in the short-term, it’s certainly worse to repeat lines in your spreadsheet, than to create a column in your tibble! (todo link to dedicated vignette)

So, again, I think the id pattern choice should reflect the level where measurements where taken.

That being said, you are a grown up adult, so choose whatever you want and fits your present (and possibly future) needs. Just think twice about it, it could save severe headaches later (and possibly sooner as imagined).

Join the two

As I told you before, this is the easy part, since you are not the first (neither the last) person who wants to join two tables. Have a look to ?dplyr::join.

Let’s exemplify this. I will first build two dummy objects, in the MomX way.

library(Momit)
library(Momocs2)
library(dplyr)

data_raw <- tibble(id   = c("chivas_75_mex_1984", "guiness_50_irl_1988"),    # two individuals
                   coo  = coo_list(list(coo_single(), coo_single())))        # two empty coo_single, but that's not our point here

data_cov <- tibble(id    = c("guiness_50_irl_1988", "chivas_75_mex_1984"),
                   type  = c("beer", "whisky"),
                   price = c(1.1, 3.14),
                   note  = c(NA, 4.1)
                   )

This appears boring here but you will never use it. Your data_raw will likely be generated using Momit, and your data_cov will be a read* to bring your spreadheet into R.

Here are our tibbles:

data_raw
#> # A tibble: 2 x 2
#>   id                  coo                   
#>   <chr>               <list<coo_single[,2]>>
#> 1 chivas_75_mex_1984  <tibble [0 × 2]>      
#> 2 guiness_50_irl_1988 <tibble [0 × 2]>
data_cov
#> # A tibble: 2 x 4
#>   id                  type   price  note
#>   <chr>               <chr>  <dbl> <dbl>
#> 1 guiness_50_irl_1988 beer    1.1   NA  
#> 2 chivas_75_mex_1984  whisky  3.14   4.1

Let’s join them now with dplyr::left_join.

library(dplyr)
df <- left_join(data_raw, data_cov, by="id")

Turning such tibble into a <mom> can be achieved in a reasonable time by a typing monkey and you can now enter MomX pipeline:

df %>% mom()
#> # A tibble: 2 x 5
#>   id                  coo                    type   price  note
#>   <chr>               <list<coo_single[,2]>> <chr>  <dbl> <dbl>
#> 1 chivas_75_mex_1984  <tibble [0 × 2]>       whisky  3.14   4.1
#> 2 guiness_50_irl_1988 <tibble [0 × 2]>       beer    1.1   NA  
#> ❯mom_tbl

Your id should be minimal: it will save time when naming cannot be automated, it will save typos, and we do not need more anyway. But where to

Yes, that’s iris:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

individual images will be named

An id is an unique identifyers accross your database. Let’s take an example using the iris dataset:

iris[c(1, 50:51, 100:101, 150), ]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1            5.1         3.5          1.4         0.2     setosa
#> 50           5.0         3.3          1.4         0.2     setosa
#> 51           7.0         3.2          4.7         1.4 versicolor
#> 100          5.7         2.8          4.1         1.3 versicolor
#> 101          6.3         3.3          6.0         2.5  virginica
#> 150          5.9         3.0          5.1         1.8  virginica

So a good id could have been of the form: species_individual. For such trivial cases, building an id is straighforward:

library(dplyr)
iris %>%
  group_by(Species) %>%
  mutate(id=paste(Species, 1:n(), sep="_")) %>%
  # only print the newly created column 'id', and the first 5 individuals
  pull(id) %>% head(5)
#> [1] "setosa_1" "setosa_2" "setosa_3" "setosa_4" "setosa_5"

Good practices

Raw data (images) and covariates (external file)

Naming images

Names

library(Momit)

I begin with listing

Collecting data

Individual naming scheme

for file names : aims for the minimal scheme ie things that cannot be deduced by any other mean than by manual typing/picking

eg:

Home_mons_vb_1
Home_mons_vb_2
Home_mons_vl_3

Acquisition template

template_acq
    site        Home / Castle / Forest1 / Forest2    # pick one
    collector   c("sl", "vb")
    tree        001
    opening     1:5
    height      c(min=0, max=5, na="none")
    i           numeric
    comment     NA_character

momacs propose template photo -> name immediately when possible decharge -> validate

Covariates

Spreadsheet are very popular to store and organize data collection and associated covariates. Anything that can be read and contain an id for joining is fine.

Lowercase are fine. Tend to only use lowercase now for code. Saving dozens of maj key press, and muting one fountain of typos are quite convincing experiments to try at home. It’s a matter of taste.

Joining

dplyr::join

Validating data

todo digest glimpse View

Backup and isolate your data

Adding new data

todo

Laurent Bouby, Allowen Evin, Sarah Ivorra↩
What I call id can also be called a primary key but I like how compact id is (versus key) and still explicit (versus i for instance)↩
I love regular expressions, and even more when I do not need them↩

Tidying databases

Good practices

Raw data

Working data

for file names : aims for the minimal scheme ie things that cannot be deduced by any other mean than by manual typing/picking

Choosing an id

Setting the problem

Choosing an id pattern

A more realistic case

An id does not have to be unique

Join the two

Good practices

Raw data (images) and covariates (external file)

Naming images

Collecting data

Individual naming scheme

Acquisition template

Covariates

Joining

Validating data

Backup and isolate your data

Adding new data

Choosing an `id`