Identify potential duplicates based on title and year

find_potential_dups(
  CitDat,
  minSimilarity = 0.6,
  potDupAfterObvDup = TRUE,
  maxNumberOfComp = 1e+06,
  quiet = FALSE
)

Arguments

CitDat	A dataframe/tibble returned by `find_obvious_dups` or `handle_obvious_dups`.
minSimilarity	Minimum similarity (between 0 and 1). Default is 0.6. (TO DO)
potDupAfterObvDup	If TRUE (default), the newly created column `pot_dup_id` is moved right next to the `obv_dup_id` column.
maxNumberOfComp	Maximum number of clean_title similarity calculations to be made. It is set to 1,000,000 by default (which corresponds to ~ 1414 clean_titles). TO DO: Document while-loop.
quiet	If `TRUE`, all output will be suppressed.

Value

A tibble containing one new column: pot_dup_id.

Details

Currently this only works for files that were generated while Citavi was set to "English" so that column names are "Short Title" etc.

Examples

example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6")
CitDat <- read_Citavi_ctv6(example_path) %>%
   find_obvious_dups() %>%
   find_potential_dups()
#> clean_title comparisons = 6 < 1,000,000 = maxNumberOfComp 
#>    calculating similarity now... 
#>    calculating similarity done: 0 sec elapsed 

CitDat %>%
   dplyr::select(clean_title_id, obv_dup_id, pot_dup_id)
#> # A tibble: 5 x 3
#>   clean_title_id obv_dup_id pot_dup_id                  
#>   <chr>          <chr>      <chr>                       
#> 1 ct_04          dup_01     NA                          
#> 2 ct_02          dup_01     potdup_01 (98.6% similarity)
#> 3 ct_03          dup_01     potdup_01 (98.6% similarity)
#> 4 ct_02          dup_02     NA                          
#> 5 ct_01          dup_01     NA                          

# check similarity yourself - it's a single typo:
CitDat %>%
   dplyr::select(clean_title)
#> # A tibble: 5 x 1
#>   clean_title                                                                   
#>   <chr>                                                                         
#> 1 more_larger_simpler_how_comparable_are_on_farm_and_on_station_trials_for_cult~
#> 2 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end        
#> 3 hritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end         
#> 4 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end        
#> 5 estimating_broad_sense_heritability_with_unbalanced_data_from_agricultural_cu~