Identify potential duplicates based on title and year

find_potential_dups(
  CitDat,
  minSimilarity = 0.6,
  potDupAfterObvDup = TRUE,
  maxNumberOfComp = 1e+06,
  quiet = FALSE
)

Arguments

CitDat

A dataframe/tibble returned by find_obvious_dups or handle_obvious_dups.

minSimilarity

Minimum similarity (between 0 and 1). Default is 0.6. (TO DO)

potDupAfterObvDup

If TRUE (default), the newly created column pot_dup_id is moved right next to the obv_dup_id column.

maxNumberOfComp

Maximum number of clean_title similarity calculations to be made. It is set to 1,000,000 by default (which corresponds to ~ 1414 clean_titles). TO DO: Document while-loop.

quiet

If TRUE, all output will be suppressed.

Value

A tibble containing one new column: pot_dup_id.

Details

[Maturing]
Currently this only works for files that were generated while Citavi was set to "English" so that column names are "Short Title" etc.

Examples

example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6") CitDat <- read_Citavi_ctv6(example_path) %>% find_obvious_dups() %>% find_potential_dups()
#> clean_title comparisons = 6 < 1,000,000 = maxNumberOfComp #> calculating similarity now... #> calculating similarity done: 0 sec elapsed
CitDat %>% dplyr::select(clean_title_id, obv_dup_id, pot_dup_id)
#> # A tibble: 5 x 3 #> clean_title_id obv_dup_id pot_dup_id #> <chr> <chr> <chr> #> 1 ct_04 dup_01 NA #> 2 ct_02 dup_01 potdup_01 (98.6% similarity) #> 3 ct_03 dup_01 potdup_01 (98.6% similarity) #> 4 ct_02 dup_02 NA #> 5 ct_01 dup_01 NA
# check similarity yourself - it's a single typo: CitDat %>% dplyr::select(clean_title)
#> # A tibble: 5 x 1 #> clean_title #> <chr> #> 1 more_larger_simpler_how_comparable_are_on_farm_and_on_station_trials_for_cult~ #> 2 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end #> 3 hritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end #> 4 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end #> 5 estimating_broad_sense_heritability_with_unbalanced_data_from_agricultural_cu~