Chances are that whenever multiple references are gathered, some of them are duplicates. Citavi knows “it is almost inevitable that a reference may appear two or more times in your Citavi project” and suggests to eliminate duplicates since they “waste time and space and distract you from your work”. More specifically, Citavi has a Show duplicates only
button you may use to “compare the references and delete the duplicates you don’t need”.
This approach is fine. However, especially when working with large Citavi projects (>1,000 references) e.g. when conducting a systematic review, the number of duplicates may become so large that a few issues arise:
One may point out that in their Citavi projects the number of duplicates was never really large, since they checked the
[ ] Don't add project duplicates
box during the import process. While this is indeed true for many, some groups like to keep all references including duplicates in their project at all times and simply move them to e.g. a duplicate category or group. This can be beneficial e.g. in a systematic review because one can always verify and compare to the total number of found references.
CitaviR distinguishes between two types of duplicates:
Shortly put, the suggested way of dealing with duplicates in CitaviR is therefore:
CitDat %>%
find_obvious_dups() %>% # 1. identify obvious duplicates
handle_obvious_dups() %>% # 2. handle obvious duplicates
find_potential_dups() # 3. identify potential duplicates
In this vignette we revisit Step 3 of the example in the Get started vigniette and go into more detail.
library(tidyverse)
library(CitaviR)
my_path <- example_file("3dupsin5refs.xlsx") # in real life: replace with path to your xlsx file
CitDat <- read_Citavi_xlsx(path = my_path)
CitaviR identifies obvious duplicates by first creating a clean_title
for each reference (i.e a simplified string consisting of Title and Year) and then comparing those between all references. Obvious duplicates have identical clean_title
.
CitDat <- CitDat %>%
find_obvious_dups()
CitaviR identifies obvious duplicates by first creating a clean_title
which is basically Title
and Year
pasted together and processed by janitor::make_clean_names()
. The latter e.g. converts the pasted string to all-lowercase and removes special characters and unnecessary spaces.
CitDat %>%
select(Title, clean_title)
#> # A tibble: 5 x 2
#> Title clean_title
#> <chr> <chr>
#> 1 Estimating broad-sense heritability ~ estimating_broad_sense_heritability_wit~
#> 2 Heritability in plant breeding on a ~ heritability_in_plant_breeding_on_a_gen~
#> 3 Heritability in Plant Breeding on a ~ heritability_in_plant_breeding_on_a_gen~
#> 4 Hritability in Plant Breeding on a G~ hritability_in_plant_breeding_on_a_geno~
#> 5 More, Larger, Simpler: How Comparabl~ more_larger_simpler_how_comparable_are_~
All references that have identical clean_title
are taken as obvious duplicates. Three additional columns are created that allow for better handling in upcoming steps.
CitDat %>%
select(clean_title:obv_dup_id)
#> # A tibble: 5 x 4
#> clean_title clean_title_id has_obv_dup obv_dup_id
#> <chr> <chr> <lgl> <chr>
#> 1 estimating_broad_sense_heritability_wit~ ct_01 FALSE dup_01
#> 2 heritability_in_plant_breeding_on_a_gen~ ct_02 TRUE dup_01
#> 3 heritability_in_plant_breeding_on_a_gen~ ct_02 TRUE dup_02
#> 4 hritability_in_plant_breeding_on_a_geno~ ct_03 FALSE dup_01
#> 5 more_larger_simpler_how_comparable_are_~ ct_04 FALSE dup_01
If multiple references were found to have identical clean_title
, their has_obv_dup
is set to TRUE
. Obvious duplicates share the same clean_title_id
but have unique obv_dup_id
. The “first” duplicate (i.e. obv_dup_id == dup_01
) can be seen as the non-duplicate and thus the only version of a reference that will be investigated further after duplicate-handling.
Note that a pair of obvious duplicates was identified (clean_title_id == ct_02
). However, due to a single typo ct_03
was not (yet) identified as a duplicate.
Further note that by default preferDupsWithPDF
is set to TRUE
. While the handling of varying information between duplicates is mostly done in the next step via handle_obvious_dups()
, this is the only exception. When TRUE
, the exported fields has_attachment
and Locations
are used to sort the references. They are sorted in a way that in case of obvious duplicates being identified, dup_01
is always the one reference that has the most PDF attachments in the Citavi project. Thus, PDF-attachment is the only attribute where CitaviR chooses the “better” duplicate while all other attributes are merged (not chosen) in the next step.
CitaviR merges varying information between obvious duplicates so that the loss of information is reduced when ultimately getting rid of all except dup_01
.
CitDat %>%
find_obvious_dups() %>%
handle_obvious_dups(fieldsToHandle = ...) # must be added
Sometimes duplicates hold different information as it is the case here for ct_02
and the columns PubMed ID
, DOI name
and Categories
:
CitDat %>%
filter(clean_title_id == "ct_02") %>%
select(clean_title_id, obv_dup_id, `DOI name`, `PubMed ID`, Categories)
#> # A tibble: 2 x 5
#> clean_title_id obv_dup_id `DOI name` `PubMed ID` Categories
#> <chr> <chr> <chr> <chr> <chr>
#> 1 ct_02 dup_01 10.1534/genetics.119.302134 <NA> 1 catA
#> 2 ct_02 dup_02 <NA> 31248886 2 catB
In such a scenario it would be best to gather all information into dup_01
. Depending on the type of information, CitaviR does this in two different way:
DOI name
, PubMed ID
, Abstract
etc., CitaviR currently simply fills up (tidyr::fill(all_of(fieldsToHandle), .direction = "up")
) entries.Categories
, Groups
and Keywords
, CitaviR collapses unique entries into the respective entry for dup_01
, while entries for all other obvious duplicates are replaced by a provided string:
CitDat <- CitDat %>%
handle_obvious_dups(fieldsToHandle = c("DOI name", "PubMed ID"),
nameDupCategories = "3 duplicate")
CitDat %>%
filter(clean_title_id == "ct_02") %>%
select(clean_title_id, obv_dup_id, `DOI name`, `PubMed ID`, Categories)
#> # A tibble: 2 x 5
#> clean_title_id obv_dup_id `DOI name` `PubMed ID` Categories
#> <chr> <chr> <chr> <chr> <chr>
#> 1 ct_02 dup_01 10.1534/genetics.119.3021~ 31248886 1 catA; 2 ca~
#> 2 ct_02 dup_02 <NA> 31248886 3 duplicate
Note that there is one exception: If Online address
is included in the fieldsToHandle
, it is not just filled up like the others. Instead, per clean_title_id
all URLs from Online address
and Location
are combined and ranked. As an example: CitaviR currently ranks URLs including “doi.org” as the best possible entry for Online address
. Finally, the URL with the highest rank is set as the Online address
for dup_01
.
While having nothing to do with CitaviR, there is a custom Citavi macro created by István that has a comparable approach to merging varying duplicate information (from duplicates identified in Citavi).
After obvious duplicates have been dealt with, potential duplicates can be identified among the remaining references as those that don’t have identical, but similar clean_title
.
CitDat %>%
find_obvious_dups() %>%
handle_obvious_dups(fieldsToHandle = ...) %>% # must be added
find_potential_dups()
For all remaining references (i.e. all dup_01
) CitaviR identifies potential duplicates as those references that have clean_title
that are similar. Similarity is calculated as the Levenshtein distance and by default a similarity > 60% is considered relevant.
CitDat <- CitDat %>%
find_potential_dups()
#> clean_title comparisons = 6 < 1,000,000 = maxNumberOfComp
#> calculating similarity now...
#> calculating similarity done: 0 sec elapsed
CitDat %>%
select(clean_title_id, obv_dup_id, pot_dup_id)
#> # A tibble: 5 x 3
#> clean_title_id obv_dup_id pot_dup_id
#> <chr> <chr> <chr>
#> 1 ct_01 dup_01 <NA>
#> 2 ct_02 dup_01 potdup_01 (98.6% similarity)
#> 3 ct_02 dup_02 <NA>
#> 4 ct_03 dup_01 potdup_01 (98.6% similarity)
#> 5 ct_04 dup_01 <NA>
CitDat %>% slice(2, 4) %>% select(clean_title) # compare clean_title yourself:
#> # A tibble: 2 x 1
#> clean_title
#> <chr>
#> 1 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end
#> 2 hritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end
Note that while the typo in the Title prevented ct_03
being identified as an obvious duplicate of ct_02
, it is now identified as a potential duplicate.
TO DO: MENTION COMPUTATIONAL BURDEN WITH LARGE NUMBER OF REFERENCES
TO DO: HINT AT HOW TO DEAL WITH IT MACRO AND CITAVI
TO DO: LINK TO THIS VIGNETTE FROM GET STARTED VIGNETTE