Bayesian spam or ham filtering for scraped content. Email junk filtering but for SCM

Tim · November 20, 2025, 6:46pm

I am looking at ways to improve the content filtering when scraping online content.

One of the ways I am investigating is the use of Bayesian filtering, ie that Mark as spam/junk button you have in your email client.

Basically it works by you training the system on what content is good=ham or irrelevant=spam.

The UI looks like this:

As you mark items spam or ham, it is then able to calculate a spam score on sentences you did not train on already.

You can mark a few sentences as spam and the system will use this to mark other similar sentences as spam for you.

The idea is by engaging in this simple mechanic you get better content in your final articles… in addition to the ignore and keep keywords option already present.

I will be looking at using this to calculate some kind of relevance or content score for every keyword of content you scrape so you can see how good the generated scraped articles are when produced.