dedup
submodule
- almasru.dedup.convert_to_numeric(str) str
Convert a text number to a numeric string.
- Parameters:
txt – string to convert
- Returns:
numeric string
- almasru.dedup.evaluate_completeness(bib1: Dict[str, Any], bib2: Dict[str, Any]) float
Return the result of the evaluation of similarity of two bib records in number of available fields.
- Parameters:
bib1 – dict containing the data of a bib record
bib2 – dict containing the data of a bib record
- Returns:
similarity score between two bib records as float
- almasru.dedup.evaluate_editions(texts1: List[str], texts2: List[str]) float
Return the result of the evaluation of similarity of two editions. If numbers are available, these are preferred to texts
- Parameters:
texts1 – list of editions to compare
texts2 – list of editions to compare
- Returns:
similarity score between two editions as float
- almasru.dedup.evaluate_extent(extent1: List[int], extent2: List[int]) float
Return the result of the evaluation of similarity of two extents.
- Parameters:
extent1 – list of extent to compare
extent2 – list of extent to compare
- Returns:
similarity score between two extents as float
- almasru.dedup.evaluate_format(format1: str, format2: str) float
Return the result of the evaluation of similarity of two formats
If format is the same it returns 1, 0 otherwise
- Parameters:
format1 – format to compare
format2 – format to compare
- Returns:
similarity score between two formats as float
- almasru.dedup.evaluate_identifiers(ids1: str, ids2: str) float
Return the result of the evaluation of similarity of two lists of identifiers.
- Parameters:
ids1 – list of identifiers to compare
ids2 – list of identifiers to compare
- Returns:
similarity score between two lists of identifiers as float
- almasru.dedup.evaluate_is_analytical(format1: str, format2: str) float
Check if records are analytical records
- Parameters:
format1 – format to compare
format2 – format to compare
- Returns:
0 if no analytical, 1 if both analytical, 0.5 otherwise.
- almasru.dedup.evaluate_is_series(format1: str, format2: str) float | None
Check if records are series
- Parameters:
format1 – format to compare
format2 – format to compare
- Returns:
nan if no series, 1 if both series, 0.5 otherwise.
- almasru.dedup.evaluate_language(lang1: str, lang2: str) float
Return the result of the evaluation of similarity of two languages.
- Parameters:
lang1 – language to compare
lang2 – language to compare
- Returns:
similarity score between two languages as float
- almasru.dedup.evaluate_lists_names(names1: List[str], names2: List[str]) float
Return the result of the best pairing authors.
The function test all possible pairings and return the max value.
- Parameters:
names1 – list of names to compare
names2 – list of names to compare
- Returns:
similarity score between two lists of names as float
- almasru.dedup.evaluate_lists_texts(texts1: List[str], texts2: List[str]) float
Return the result of the best pairing texts.
- Parameters:
texts1 – list of texts to compare
texts2 – list of texts to compare
- Returns:
similarity score between two lists of texts as float
- almasru.dedup.evaluate_names(name1: str, name2: str) float
Return the result of the evaluation of similarity of two names.
- Parameters:
name1 – name to compare
name2 – name to compare
- Returns:
similarity score between two names as float
- almasru.dedup.evaluate_parents(parent1: Dict, parent2: Dict) float
Evaluate similarity based on the link to the parent
Keys of the parent dictionary: - title: title of the parent - issn: content of $x - isbn: content of $z - number: content of $g no:<content> - year: content of $g yr:<content> or first 4 digits numbers in a $g - parts: longest list of numbers in a $g
- Parameters:
parent1 – dictionary with parent information
parent2 – dictionary with parent information
- Returns:
similarity score between two parents
- almasru.dedup.evaluate_similarity(bib1: Dict[str, Any], bib2: Dict[str, Any]) Dict[str, float]
The function returns a dictionary with keys corresponding to the fields of the bib records and values corresponding to the similarity score of the fields.
- Parameters:
bib1 – Dict[str, Any] containing the data of a bib record
bib2 – Dict[str, Any] containing the data of a bib record
Return the result of the evaluation of similarity of two bib records.
- almasru.dedup.evaluate_sysnums(val1, val2)
evaluate_identifiers(ids1: str, ids2: str) -> float Return the result of the evaluation of similarity of two lists of system numbers.
It considers only the system numbers with the same prefix.
- Parameters:
ids1 – list of system numbers to compare
ids2 – list of system numbers to compare
- Returns:
similarity score between two lists of system numbers as float
- almasru.dedup.evaluate_texts(text1: str, text2: str) float
Return the result of the evaluation of similarity of two texts.
- Parameters:
text1 – text to compare
text2 – text to compare
- Returns:
similarity score between two texts as float
- almasru.dedup.evaluate_year(year1: int, year2: int) float
Return the result of the evaluation of similarity of two years.
- Parameters:
year1 – year to compare
year2 – year to compare
- Returns:
similarity score between two years as float
- almasru.dedup.get_ascii(txt: str) str
Return the ascii version of a string.
- Parameters:
txt – string to convert
- Returns:
ascii version of the string
- almasru.dedup.get_similarity_score(bib1: Dict[str, Any], bib2: Dict[str, Any]) float
Get the similarity score between two bib records.
With classifiers, the function returns the similarity score between two bib records. The threshold to determine if two records are similar or not is 0.5.
- Parameters:
bib1 – Dict[str, Any] containing the data of a bib record
bib2 – Dict[str, Any] containing the data of a bib record
clf – MLPClassifier used to predict the similarity score if none is given the function will calculate the mean of the similarity scores of the fields
nan – value to use if the similarity score is NaN
- Returns:
similarity score between two bib records as float
- almasru.dedup.get_unique_combinations(l1: List[str], l2: List[str]) List[List[Tuple]]
Used to search the best match with names like authors or publishers.
- Parameters:
l1 – list of names to compare
l2 – list of names to compare
- Returns:
list of unique combinations of names
- almasru.dedup.handling_missing_values(fn: Callable) Callable
Decorator to handle missing values.