How do I refer to entire scraped article from search engine?

In Macros, How do I refer to entire scraped article from search engine?

I see you can pull %scraped_sentence% and %scraped_paragraph% … What if I need a larger context ?? … (which I do in at least one use case).

Is there a %scraped_article% macro possible? While I don’t plan to copy entire articles, I need the AI to be able to understand the entire article as RAG.

There are only sentences, SCM doesn’t keep entire articles in storage.

You have access to grouped sentences,

If you use insert scraped content, you can select a bunch of sentences to be inserted at once.

The macro code to refer to it is:

So you have to 2 options,

  1. Prompt with a bunch of %scraped_paragraphs%
  2. Use insert scraped content, then refer to it using %scraped_body% in your prompt

Either option will give the AI extra context

But it may be unrelated, right? Meaning each paragraph is randomly selected from different articles.

It’s hard to teach an AI based on random thoughts, especially if the aim of the articles are different in scope.

The “advantages of home gardening” vs “gardening tools from Home Depot”. etc

We are going to end up with AI articles that seem to have random tangent thoughts. I already had one talk about a whole different topic due to a homonym.

I was hoping I could feed the AI article by article.

How do I train the AI coherent data from the web. Am I missing something here?

Article by article may not work actually, since in one article you could have both “advantages of home gardening” vs “gardening tools from Home Depot” topics.

I suggest you try instead %scraped_title% or %scraped_subheading%

image

If what you want is a longform article, you should be using ai outline template.

There is also technically a max token for some AI models that won’t allow you to send in entire articles.

Another idea also is that you give it an article you like, then tell it to re-write the article to make it unique?

Basically, what I’m trying to do, is train the AI with articles from the top X results … have AI craft an outline based on what it learned WHILE FOCUSING ON THE KEYWORD and AI-WRITTEN TITLE … and then complete the article based on RAG (Retrieval Augmented Generation) from the Top X articles. The point here is to prevent AI hallucination and provide reasonably-certain fact-based articles. However, it’s not easy to accomplish this when paragraphs are jumbled by random selective inclusion and order. See the dilemma?

The method you are proposing by creating the outlines etc, you are still subject to AI hallucination based on the model’s original training. This will always be a problem, unless the model is fine-tuned when creating the output. This fine-tuning and humanizing the writing style is also why the output takes so long and breaks the 180s timeout.

FYI, Many LLMs now have 128k context windows … so there is plenty of room to add one article at a time … and that is only going to grow.

So the steps you want are

  1. Send 1 entire article to the AI
  2. Send instruction to the AI to create an outline based on that article

So something like:

image

Would this work?

1 Like

Yes … if that would send the entire article body (no html or extra text) … and then proceed to /next article for next call??

I would need to make some adaptions, but yes it would give you article text only.

Article text would be body inner text, which includes menu items etc though.

We have a detect article algo too, so we can figure out which one is best for you on testing.

Each extra call to article would load the next article in the list.

So we can create new articles using existing ranking articles to keep hallucinations low and article content relevant to the keyword.

I think that would be helpful … yes. I know there is no “perfect” here. I just want to be able to train the model with coherent data.

Ok so I added new macro %scraped_article%

image

It loads a text version of a site finding only content in <p> tags

This is important as it removes most of the other fluff found in nav tags etc.

There is now a new cache item being saved, articles.txt

image

You can locate cache to view what is being saved.

image

Right now SCM saves both the article title and body

image

However only the body is available as a macro, maybe we can add title later.

Here is an example prompt we can now write

image

The prompt generates an outline for the article.

Also useful is the fact we can use AI to ignore articles that are not relevant to the keyword.

If the article isn’t about our keyword, right now it can output to file

image

We can maybe use this later to tell SCM WP poster to ignore the article etc.

Feature is ready for testing in the next update.

I would be interested in how you plan to use this macro.

Most likely you will need to actually add a user macro?

1 Like

Hi…

Just want to chime in, since I’m going to use scrape_article for my use case.

I think scrape_article is very useful, it can be compressed using SPR* prompt for llm context (LCW vs RAG, long content window wins in every benchmark).

The main use of SPR (sparse priming representation) is for token efficiency (without losing original meaning to avoid hallucination/forced creativity).

So instead of using full article text as llm context, use SPR (a compressed version).

References

Can I use global user macro for creating SPR from scrape_article?

Not sure what you mean by this.

Can you elaborate a bit more?

Also does SPR return text or it returns vectors?

Text, think of SPR as text summarization without fluff.

So I can run SPR on text and the output can be fed into openai prompt? I checked above links but it wasn’t very clear.

Yep, scrape_article full text → openai SPR prompt → article creator/writer then can use the SPR(s) (compressed version) instead of the full text.

Imagine 15k word as a context, it will compress 20x and you still have same quality context.

You can use these SPR(s) for future task.

From the example I saw, it uses LLM to convert text to spr.

Once you have the spr, can it be used by LM to generate content?

Will need dig deeper as example only shows how to compress and decompress text, but not what to do with the spr.

Yes, this is the purpose of SPR, deliberately written for LLM, not human.

Here’s an example of SPR from an 8k words essay.

Unpredictable Tech, Predictable Humans: Human desires drive tech evolution; tech itself is stochastic.
Core Human Desires: Safety, success, social connection (STC).
AI Predictions: Future developments in AI based on human needs, not tech trends.

Digital Assistants (DAs):

  • Central to personal AI ecosystem.
  • Will know everything about users (health, finances, preferences).
  • Users will trade privacy for functionality.

API Proliferation:

  • Everything will have an API (Daemon).
  • Businesses and individuals will broadcast capabilities.
  • DAs will interact with these APIs for user benefit.

Mediation by DAs:

  • DAs will manage interactions with numerous APIs.
  • Users will express needs; DAs will fulfill them.

Active Advocacy:

  • DAs will monitor and filter information to protect users.
  • Continuous defense against manipulation and threats.

Ecosystem of DA Modules:

  • Companies will create specialized modules for DAs.
  • Marketplace for modules will thrive, enhancing DA capabilities.

AR Interfaces:

  • Augmented reality will visualize DA data.
  • Users will see character, personality, and functionality of entities around them.

Role-Based DAs:

  • Multiple DAs for specific tasks, each with distinct personalities.
  • Collaboration among DAs for holistic support.

Security Concerns:

  • Risks of DA hacks and API vulnerabilities.
  • Potential for influence operations through DAs.

Societal Implications:

  • Impact on human relationships and privacy.
  • Possible polarization in communication styles due to DA presence.

Conclusion:

  • New AI ecosystem driven by human needs is inevitable.
  • Focus on safety, thriving, and connection will shape tech development.

SPR … sounds great … Basically you’re asking the LLM to take an entire 3000 word article and give me “just the facts, maam”. I’m going to do some testing on this and let you know how it works using his model.

1 Like