Use n8n to automate Google scraper task

thank you for doing this .. but I am having a little trouble understanding video .. as there is little explanation.

If I have Google Scraper task setup .. how in n8n can load SCM Google Scraper task, update keywords, run the task, work with data IN N8N .. so i can save to DB , then loop to next set of keywords etc etc ???

You need to do the following:

  1. Use n8n node to update a task
  2. Add another node to run the task
  3. Then you need to wait for SCM to finish, by using another node to check for status (not sure about this)
  4. Then you will need a node to download the output of the data (which currently can’t be done right now)
  5. Save to your db

Not sure about the looping part, but lets get the above to work first.

Step 4 can’t be done right now, so let me see what is best way to return that task data to you.

I will then do a write up with step by step on how to do your example workflow.

That was the part I was having trouble with. the only way i could think of is wait for status to complete then import the CSV or Google Sheet file into n8n. It would be so much nicer if SCM could just return the array of data to n8n as a JSON so we could use it directly in n8n .. CSV imports can lead to issues etc.

I think I will add a new call for task content that will read files from the task and output it to json as name and data

Data will be dump of what ever is in the file. Which for url scraper task should be list of urls.

The only other thing though is that each keyword outputs to a different folder.

So maybe JSON of dir structure with content listed out.

ok .. that might work .. i think. it would be better than trying to import CSV in n8n i think

1 Like

Im thinking that we use some type of ftp.

So scm opens ftp for all folders under content.

That way you can download content directly and also get file listing.

Cause it’s ftp, you can also delete and manage output files.

The only disadvantage is that if you output files to your own custom directory the ftp won’t be able to access the files.

Right now scm default output to its own content/taskid.

The problem right now is scm can output files to multiple folders and with unknown count.

Eg article creator and each keyword has its own folder.

FTP? why FTP? :roll_eyes:

all we need to do is return a JSON response back to n8n with the data in final task array, right? Unless SCM “forgets” each record when it saves to file.

I see you have 2 choices:

  1. Add an array that accumulates task results as strings in RAM then return as JSON to n8n.
  2. OR, upon task completion, load data from disk and put it in a JSON string to return to n8n.

n8n works extremely well with JSON formats. Not sure why you would want FTP at all.

I’m thinking you will probably go with #2 .. because then n8n doesn’t have to wait for task to complete while processing other nodes.

in n8n ..

  1. Update Keywords
  2. Run Task
  3. Check Status
    a. Loop over items → Check Task Status → Run other Nodes - Next Loop Item.
    b. OR run n8n workflow on a timed schedule … either one can work in n8n
  4. Upon Complete, Get task final result data

#4 is what I think you are creating now .. so it would have to be a file load from /content folder .. either CSV file for scaper .. or array of TXT/HTML files for other types for jobs.

It will be an array of files.

However the array can be as small as one csv file for google maps or it can be multiple folders of multiple files for like the article creator.

I’m thinking about where you load 100 keywords, generate 10 articles = 10,000 articles on the hard drive.

You need to be able to access each article individually on n8n.

Something like ftp, so file listing and selection so you can download what you need.

I’m also thinking about cloud version of n8n, where you don’t have access to files on the computer, so needs SCM to provide it.

In the current API, you must provide both task id and keyword and uses both to return content.

Not sure how you set SCM up .. but here is the architecture I would recommend:

  1. Add another API endpoint to SCM that retrieves a file and returns contents of a result file in a JSON based on TaskID + Keyword (+ ArticleID if applicable) … THEN .. if its a CSV (like scraper).. break it down to rows/fields in JSON so n8n can work directly with each data point.

  2. In SCM n8n node .. let it query the new API Endpoint for each Keyword/ArticleID from the task record for JSON data.

This solution much cleaner and more reliable than using FTP.

Yes I am thinking something like what you are proposing.

Call to task/content/taskid

Which goes to output folder, and returns all files in that.

If there are folders under it, you would call it

task/content/taskid/folder1/folder2

To return other files.

Right now the output path is generated dynamically at run time and not cached.

So I need to figure this part out to get the parent directory anyway.

why would there be more nested folders?

I thought it was
task/content/taskid/Keyword/Article#

Its because of macros

I’m testing it so that if you call

task/content/683d8886e964bc93aa9dccb5/

It gets everything in that default parent folder

Then if you add

task/content/683d8886e964bc93aa9dccb5/folder1/folder2

It can go down to find content in that

Right now it would also be nice if it returned file/folder listings in output json so you can navigate

yes, I forgot you gave user full flexibility on naming convention.

so you would have to refer to the naming convention the user has saved .. then pull the data based on that. You should pull actual file structure from filesystem to confirm and match it to the naming convention for reference.

Then you return JSON string .. based on similar folder structure. Although JSON might get large .. but n8n users should be aware to keep their tasks small and let n8n iterate through the bigger automation.

Im thinking this:

On call it returns directory, filenames and content.

Directory is showing elden and hello.

If you add dir name to url request…

Which is mapping to this in the output folder.

Right now it will only return txt csv html files

shouldn’t you nest the files IN the folders? JSON is nestable.
Although, that could make it a bit tougher for n8n beginners.

Ahhh … do URIs …

instead of trying to duplicate file structure in JSON … do like sitemap with URIs

File: “/folder1/folders2/filename.txt”,
Keyword: “elden ring”,
Content: “Elden Ring review … etc etc”

Makes it simpler for you and n8n beginners

But also … if CSV (like scraper)… then break down rows, fields inside JSON as well. so that n8n can work with the data directly … users wont have to figure how to import/convert a huge CSV string.

Avoid nesting because it can potentially return 100s of files and overload the server.

I can include url links in the output. eg:

Which is good because you can click on these in postman.

Ill add special handler for CSV files

eg

Content is returned as data (instead of content) but it is a proper json response

1 Like

n8n node has been updated.

You will need to update SCM as well to get the new API changes.

1 Like