Identify obvious duplicates based on title and year

find_obvious_dups(CitDat, dupInfoAfterID = TRUE, preferDupsWithPDF = TRUE)

Arguments

CitDat	A dataframe/tibble returned by `read_Citavi_xlsx`. The following columns must be present: `ID`, `Title`, `Year`.
dupInfoAfterID	If TRUE (default), the newly created columns `clean_title`, `clean_title_id`, `has_obv_dup` and `obv_dup_id` are moved right next to the `ID` column. Additionally, the `ID` column is moved to the first position.
preferDupsWithPDF	If TRUE (default), obvious duplicates are sorted by their info in columns `has_attachment` and/or `Locations` (given they are present in the dataset). After sorting, duplicates with the most occurences of `".pdf"` in `Locations` and a `TRUE` in `has_attachment` are first and will thus be chosen as `dup_01`.

Value

A tibble containing four additional columns: clean_title, clean_title_id, has_obv_dup and obv_dup_id.

Details

Currently this only works for files that were generated while Citavi was set to "English" so that column names are "Short Title" etc.

Examples

example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6")
read_Citavi_ctv6(example_path) %>%
   find_obvious_dups() %>%
   dplyr::select(clean_title:obv_dup_id)
#> # A tibble: 5 x 4
#>   clean_title                              clean_title_id has_obv_dup obv_dup_id
#>   <chr>                                    <chr>          <lgl>       <chr>     
#> 1 more_larger_simpler_how_comparable_are_~ ct_04          FALSE       dup_01    
#> 2 heritability_in_plant_breeding_on_a_gen~ ct_02          TRUE        dup_01    
#> 3 hritability_in_plant_breeding_on_a_geno~ ct_03          FALSE       dup_01    
#> 4 heritability_in_plant_breeding_on_a_gen~ ct_02          TRUE        dup_02    
#> 5 estimating_broad_sense_heritability_wit~ ct_01          FALSE       dup_01