I spent last 2 days seeing if we can use small LMs to tag content and find only the most relevant bits when scraping content from a search engine.
After a lot of comparison with the old content filter code, I found out that for now small LMs don’t really appreciably improve the relevance of the content.
However not all is lost.
I was able to spend time tweaking the current content filtering and have made some nice improvements.
Long story short…
When you are scraping content from search inside the Article creator, the found content will be more relevant.
Which means AI prompts and scraped content generator will produce better more relevant content.
If you want to know the details…
-
content filter will process paragraphs of content, and also the sentences in those paragraphs to weed out any sentences not relevant in that paragraph. eg Person intros, generic site footer text.
-
The content filter should be more language agnostic now and work better for non english languages.
-
The content filter was less selective, but in this update it will throw out more content it deems not as relevant. If you are getting less content you might need to scrape more urls.
-
Search engine matters, sometimes Bing likes to throw in very irrelevant search results, sometimes SCM content filter is unable to filter this content out because it shares too many keywords with your main article keyword.
-
If you take the time, you can always remove url results that you deem not useful. Do this in ignore urls.
eg. Ignoring some bad urls results from being scraped.
Although this is not a scalable process, it can be the small tweak you need to get vastly better content from search engines.
AND
If anyone has experience with transformers from small LMs and you have a feature request let me know.
It is now possible from here on out to do things such as…
- Text classification (spam filtering, toxicity detection, sentiment)
- Topic detection & tagging
- Intent recognition (chatbots, virtual assistants)
Because we can install and package small LMs into the SCM app directly!

