`dedup` submodule

almasru.dedup.convert_to_numeric(str) → str

Convert a text number to a numeric string.

Parameters:: txt – string to convert
Returns:: numeric string

almasru.dedup.evaluate_completeness(bib1: Dict[str, Any], bib2: Dict[str, Any]) → float

Return the result of the evaluation of similarity of two bib records in number of available fields.

Parameters:

bib1 – dict containing the data of a bib record
bib2 – dict containing the data of a bib record

Returns:

similarity score between two bib records as float

almasru.dedup.evaluate_editions(texts1: List[str], texts2: List[str]) → float

Return the result of the evaluation of similarity of two editions. If numbers are available, these are preferred to texts

Parameters:

texts1 – list of editions to compare
texts2 – list of editions to compare

Returns:

similarity score between two editions as float

almasru.dedup.evaluate_extent(extent1: List[int], extent2: List[int]) → float

Return the result of the evaluation of similarity of two extents.

Parameters:

extent1 – list of extent to compare
extent2 – list of extent to compare

Returns:

similarity score between two extents as float

almasru.dedup.evaluate_format(format1: str, format2: str) → float

Return the result of the evaluation of similarity of two formats

If format is the same it returns 1, 0 otherwise

Parameters:

format1 – format to compare
format2 – format to compare

Returns:

similarity score between two formats as float

almasru.dedup.evaluate_identifiers(ids1: str, ids2: str) → float

Return the result of the evaluation of similarity of two lists of identifiers.

Parameters:

ids1 – list of identifiers to compare
ids2 – list of identifiers to compare

Returns:

similarity score between two lists of identifiers as float

almasru.dedup.evaluate_is_analytical(format1: str, format2: str) → float

Check if records are analytical records

Parameters:

format1 – format to compare
format2 – format to compare

Returns:

0 if no analytical, 1 if both analytical, 0.5 otherwise.

almasru.dedup.evaluate_is_series(format1: str, format2: str) → float | None

Check if records are series

Parameters:

format1 – format to compare
format2 – format to compare

Returns:

nan if no series, 1 if both series, 0.5 otherwise.

almasru.dedup.evaluate_language(lang1: str, lang2: str) → float

Return the result of the evaluation of similarity of two languages.

Parameters:

lang1 – language to compare
lang2 – language to compare

Returns:

similarity score between two languages as float

almasru.dedup.evaluate_lists_names(names1: List[str], names2: List[str]) → float

Return the result of the best pairing authors.

The function test all possible pairings and return the max value.

Parameters:

names1 – list of names to compare
names2 – list of names to compare

Returns:

similarity score between two lists of names as float

almasru.dedup.evaluate_lists_texts(texts1: List[str], texts2: List[str]) → float

Return the result of the best pairing texts.

Parameters:

texts1 – list of texts to compare
texts2 – list of texts to compare

Returns:

similarity score between two lists of texts as float

almasru.dedup.evaluate_names(name1: str, name2: str) → float

Return the result of the evaluation of similarity of two names.

Parameters:

name1 – name to compare
name2 – name to compare

Returns:

similarity score between two names as float

almasru.dedup.evaluate_parents(parent1: Dict, parent2: Dict) → float

Evaluate similarity based on the link to the parent

Keys of the parent dictionary: - title: title of the parent - issn: content of $x - isbn: content of $z - number: content of $g no:<content> - year: content of $g yr:<content> or first 4 digits numbers in a $g - parts: longest list of numbers in a $g

Parameters:

parent1 – dictionary with parent information
parent2 – dictionary with parent information

Returns:

similarity score between two parents

almasru.dedup.evaluate_similarity(bib1: Dict[str, Any], bib2: Dict[str, Any]) → Dict[str, float]

The function returns a dictionary with keys corresponding to the fields of the bib records and values corresponding to the similarity score of the fields.

Parameters:

bib1 – Dict[str, Any] containing the data of a bib record
bib2 – Dict[str, Any] containing the data of a bib record

Return the result of the evaluation of similarity of two bib records.

almasru.dedup.evaluate_sysnums(val1, val2)

evaluate_identifiers(ids1: str, ids2: str) -> float Return the result of the evaluation of similarity of two lists of system numbers.

It considers only the system numbers with the same prefix.

Parameters:

ids1 – list of system numbers to compare
ids2 – list of system numbers to compare

Returns:

similarity score between two lists of system numbers as float

almasru.dedup.evaluate_texts(text1: str, text2: str) → float

Return the result of the evaluation of similarity of two texts.

Parameters:

text1 – text to compare
text2 – text to compare

Returns:

similarity score between two texts as float

almasru.dedup.evaluate_year(year1: int, year2: int) → float

Return the result of the evaluation of similarity of two years.

Parameters:

year1 – year to compare
year2 – year to compare

Returns:

similarity score between two years as float

almasru.dedup.get_ascii(txt: str) → str

Return the ascii version of a string.

Parameters:: txt – string to convert
Returns:: ascii version of the string

almasru.dedup.get_similarity_score(bib1: Dict[str, Any], bib2: Dict[str, Any]) → float

Get the similarity score between two bib records.

With classifiers, the function returns the similarity score between two bib records. The threshold to determine if two records are similar or not is 0.5.

Parameters:

bib1 – Dict[str, Any] containing the data of a bib record
bib2 – Dict[str, Any] containing the data of a bib record
clf – MLPClassifier used to predict the similarity score if none is given the function will calculate the mean of the similarity scores of the fields
nan – value to use if the similarity score is NaN

Returns:

similarity score between two bib records as float

almasru.dedup.get_unique_combinations(l1: List[str], l2: List[str]) → List[List[Tuple]]

Used to search the best match with names like authors or publishers.

Parameters:

l1 – list of names to compare
l2 – list of names to compare

Returns:

list of unique combinations of names

almasru.dedup.handling_missing_values(fn: Callable) → Callable: Decorator to handle missing values.

dedup submodule

`dedup` submodule