R/find_potential_dups.R
find_potential_dups.Rd
Identify potential duplicates based on title and year
find_potential_dups( CitDat, minSimilarity = 0.6, potDupAfterObvDup = TRUE, maxNumberOfComp = 1e+06, quiet = FALSE )
CitDat | A dataframe/tibble returned by |
---|---|
minSimilarity | Minimum similarity (between 0 and 1). Default is 0.6. (TO DO) |
potDupAfterObvDup | If TRUE (default), the newly created column
|
maxNumberOfComp | Maximum number of clean_title similarity calculations to be made. It is set to 1,000,000 by default (which corresponds to ~ 1414 clean_titles). TO DO: Document while-loop. |
quiet | If |
A tibble containing one new column: pot_dup_id
.
Currently this only works for files that were generated while Citavi
was set to "English" so that column names are "Short Title" etc.
example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6") CitDat <- read_Citavi_ctv6(example_path) %>% find_obvious_dups() %>% find_potential_dups()#> clean_title comparisons = 6 < 1,000,000 = maxNumberOfComp #> calculating similarity now... #> calculating similarity done: 0 sec elapsed#> # A tibble: 5 x 3 #> clean_title_id obv_dup_id pot_dup_id #> <chr> <chr> <chr> #> 1 ct_04 dup_01 NA #> 2 ct_02 dup_01 potdup_01 (98.6% similarity) #> 3 ct_03 dup_01 potdup_01 (98.6% similarity) #> 4 ct_02 dup_02 NA #> 5 ct_01 dup_01 NA#> # A tibble: 5 x 1 #> clean_title #> <chr> #> 1 more_larger_simpler_how_comparable_are_on_farm_and_on_station_trials_for_cult~ #> 2 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end #> 3 hritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end #> 4 heritability_in_plant_breeding_on_a_genotype_difference_basis_2019_end #> 5 estimating_broad_sense_heritability_with_unbalanced_data_from_agricultural_cu~