Geonames Toponym Extractor Utility

Simple script for extracting ASCII toponym fields from geonames datasets

– Created: April 21, 2024 UTC

– Edited: July 28, 2024 UTC

– Tags: Python, Script, Programming


Link to code

Small script I used for extracting data for machine learning endeavors.

Usage:

dataset feature_class [feature_code] [--dirty] [--filter=mask]

From this invokation …

./extractor.py datasets/UA.txt P PPL --filter=0123456789\"\'-\` > UA-prep.txt

… it produces a newline separated list of relevant toponyms of particular kind, such as:

Katerynivka
Vaniushkyne
Svistuny
Sopych
Shilova Balka

--filter= option is there so that aplhabet size could be reduced for learning purposes, as there are usually quite a lot of symbols that are only found few times, which produces poor balancing.

--dirty option reduces cases such as Maydan (Ispas) and CHAYKA-Transmitter, Ring Mast 4 to Maydan and CHAYKA-Transmitter.

Duplicates are also removed.