dedup submodule

almasru.dedup.convert_to_numeric(str) str

Convert a text number to a numeric string.

Parameters:

txt – string to convert

Returns:

numeric string

almasru.dedup.evaluate_completeness(bib1: Dict[str, Any], bib2: Dict[str, Any]) float

Return the result of the evaluation of similarity of two bib records in number of available fields.

Parameters:
  • bib1 – dict containing the data of a bib record

  • bib2 – dict containing the data of a bib record

Returns:

similarity score between two bib records as float

almasru.dedup.evaluate_editions(texts1: List[str], texts2: List[str]) float

Return the result of the evaluation of similarity of two editions. If numbers are available, these are preferred to texts

Parameters:
  • texts1 – list of editions to compare

  • texts2 – list of editions to compare

Returns:

similarity score between two editions as float

almasru.dedup.evaluate_extent(extent1: List[int], extent2: List[int]) float

Return the result of the evaluation of similarity of two extents.

Parameters:
  • extent1 – list of extent to compare

  • extent2 – list of extent to compare

Returns:

similarity score between two extents as float

almasru.dedup.evaluate_format(format1: str, format2: str) float

Return the result of the evaluation of similarity of two formats

If format is the same it returns 1, 0 otherwise

Parameters:
  • format1 – format to compare

  • format2 – format to compare

Returns:

similarity score between two formats as float

almasru.dedup.evaluate_identifiers(ids1: str, ids2: str) float

Return the result of the evaluation of similarity of two lists of identifiers.

Parameters:
  • ids1 – list of identifiers to compare

  • ids2 – list of identifiers to compare

Returns:

similarity score between two lists of identifiers as float

almasru.dedup.evaluate_is_analytical(format1: str, format2: str) float

Check if records are analytical records

Parameters:
  • format1 – format to compare

  • format2 – format to compare

Returns:

0 if no analytical, 1 if both analytical, 0.5 otherwise.

almasru.dedup.evaluate_is_series(format1: str, format2: str) float | None

Check if records are series

Parameters:
  • format1 – format to compare

  • format2 – format to compare

Returns:

nan if no series, 1 if both series, 0.5 otherwise.

almasru.dedup.evaluate_language(lang1: str, lang2: str) float

Return the result of the evaluation of similarity of two languages.

Parameters:
  • lang1 – language to compare

  • lang2 – language to compare

Returns:

similarity score between two languages as float

almasru.dedup.evaluate_lists_names(names1: List[str], names2: List[str]) float

Return the result of the best pairing authors.

The function test all possible pairings and return the max value.

Parameters:
  • names1 – list of names to compare

  • names2 – list of names to compare

Returns:

similarity score between two lists of names as float

almasru.dedup.evaluate_lists_texts(texts1: List[str], texts2: List[str]) float

Return the result of the best pairing texts.

Parameters:
  • texts1 – list of texts to compare

  • texts2 – list of texts to compare

Returns:

similarity score between two lists of texts as float

almasru.dedup.evaluate_names(name1: str, name2: str) float

Return the result of the evaluation of similarity of two names.

Parameters:
  • name1 – name to compare

  • name2 – name to compare

Returns:

similarity score between two names as float

almasru.dedup.evaluate_parents(parent1: Dict, parent2: Dict) float

Evaluate similarity based on the link to the parent

Keys of the parent dictionary: - title: title of the parent - issn: content of $x - isbn: content of $z - number: content of $g no:<content> - year: content of $g yr:<content> or first 4 digits numbers in a $g - parts: longest list of numbers in a $g

Parameters:
  • parent1 – dictionary with parent information

  • parent2 – dictionary with parent information

Returns:

similarity score between two parents

almasru.dedup.evaluate_similarity(bib1: Dict[str, Any], bib2: Dict[str, Any]) Dict[str, float]

The function returns a dictionary with keys corresponding to the fields of the bib records and values corresponding to the similarity score of the fields.

Parameters:
  • bib1 – Dict[str, Any] containing the data of a bib record

  • bib2 – Dict[str, Any] containing the data of a bib record

Return the result of the evaluation of similarity of two bib records.

almasru.dedup.evaluate_sysnums(val1, val2)

evaluate_identifiers(ids1: str, ids2: str) -> float Return the result of the evaluation of similarity of two lists of system numbers.

It considers only the system numbers with the same prefix.

Parameters:
  • ids1 – list of system numbers to compare

  • ids2 – list of system numbers to compare

Returns:

similarity score between two lists of system numbers as float

almasru.dedup.evaluate_texts(text1: str, text2: str) float

Return the result of the evaluation of similarity of two texts.

Parameters:
  • text1 – text to compare

  • text2 – text to compare

Returns:

similarity score between two texts as float

almasru.dedup.evaluate_year(year1: int, year2: int) float

Return the result of the evaluation of similarity of two years.

Parameters:
  • year1 – year to compare

  • year2 – year to compare

Returns:

similarity score between two years as float

almasru.dedup.get_ascii(txt: str) str

Return the ascii version of a string.

Parameters:

txt – string to convert

Returns:

ascii version of the string

almasru.dedup.get_similarity_score(bib1: Dict[str, Any], bib2: Dict[str, Any]) float

Get the similarity score between two bib records.

With classifiers, the function returns the similarity score between two bib records. The threshold to determine if two records are similar or not is 0.5.

Parameters:
  • bib1 – Dict[str, Any] containing the data of a bib record

  • bib2 – Dict[str, Any] containing the data of a bib record

  • clf – MLPClassifier used to predict the similarity score if none is given the function will calculate the mean of the similarity scores of the fields

  • nan – value to use if the similarity score is NaN

Returns:

similarity score between two bib records as float

almasru.dedup.get_unique_combinations(l1: List[str], l2: List[str]) List[List[Tuple]]

Used to search the best match with names like authors or publishers.

Parameters:
  • l1 – list of names to compare

  • l2 – list of names to compare

Returns:

list of unique combinations of names

almasru.dedup.handling_missing_values(fn: Callable) Callable

Decorator to handle missing values.