Geonames Toponym Extractor Utility
Simple script for extracting ASCII toponym fields from geonames datasets
– Created: April 21, 2024 UTC
– Edited: July 28, 2024 UTC
– Tags: Python, Script, Programming
Small script I used for extracting data for machine learning endeavors.
Usage:
dataset feature_class [feature_code] [--dirty] [--filter=mask]
From this invokation …
./extractor.py datasets/UA.txt P PPL --filter=0123456789\"\'-\` > UA-prep.txt
… it produces a newline separated list of relevant toponyms of particular kind, such as:
Katerynivka
Vaniushkyne
Svistuny
Sopych
Shilova Balka
--filter=
option is there so that aplhabet size could be reduced for learning purposes,
as there are usually quite a lot of symbols that are only found few times,
which produces poor balancing.
--dirty
option reduces cases such as Maydan (Ispas)
and CHAYKA-Transmitter, Ring Mast 4
to Maydan
and CHAYKA-Transmitter
.
Duplicates are also removed.