Identify obvious duplicates based on title and year

find_obvious_dups(CitDat, dupInfoAfterID = TRUE, preferDupsWithPDF = TRUE)

Arguments

CitDat

A dataframe/tibble returned by read_Citavi_xlsx. The following columns must be present: ID, Title, Year.

dupInfoAfterID

If TRUE (default), the newly created columns clean_title, clean_title_id, has_obv_dup and obv_dup_id are moved right next to the ID column. Additionally, the ID column is moved to the first position.

preferDupsWithPDF

If TRUE (default), obvious duplicates are sorted by their info in columns has_attachment and/or Locations (given they are present in the dataset). After sorting, duplicates with the most occurences of ".pdf" in Locations and a TRUE in has_attachment are first and will thus be chosen as dup_01.

Value

A tibble containing four additional columns: clean_title, clean_title_id, has_obv_dup and obv_dup_id.

Details

[Maturing]
Currently this only works for files that were generated while Citavi was set to "English" so that column names are "Short Title" etc.

Examples

example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6") read_Citavi_ctv6(example_path) %>% find_obvious_dups() %>% dplyr::select(clean_title:obv_dup_id)
#> # A tibble: 5 x 4 #> clean_title clean_title_id has_obv_dup obv_dup_id #> <chr> <chr> <lgl> <chr> #> 1 more_larger_simpler_how_comparable_are_~ ct_04 FALSE dup_01 #> 2 heritability_in_plant_breeding_on_a_gen~ ct_02 TRUE dup_01 #> 3 hritability_in_plant_breeding_on_a_geno~ ct_03 FALSE dup_01 #> 4 heritability_in_plant_breeding_on_a_gen~ ct_02 TRUE dup_02 #> 5 estimating_broad_sense_heritability_wit~ ct_01 FALSE dup_01