Scrape to single CSV instead of separate TXTs?

scrap3r · June 9, 2024, 8:38am

The static scraper only allows scraping website content to individual TXT files or one large TXT file. However, to further process the scraped data, I want it in a CSV/Google Sheet with one row for each URL, like column A = URL and column B = scraped content. How can this be achieved?

Tim · June 10, 2024, 12:33am

The static scraper doesn’t have CSV or Google spreadsheet support right now.

You could do it using the Dynamic scraper instead

eg:

Selectors are:

Project sample

title body csv.zip (1.7 KB)

scrap3r · June 10, 2024, 7:52am

I did not think of that - thank you!

Tim · June 11, 2024, 7:07am

Let me know how it goes.

Also if the Google export works fine as well

scrap3r · June 11, 2024, 10:16am

Initial feedback:

I use innerText with body selector because “detect article” skips too much of the content I need. The result is a broken CSV unfortunately, I guess because of the quotes + commas + semicolons inside the scraped content which results in new rows. I’m currently testing this with Google Sheets output, but seeing that it first writes everything into a CSV as well instead of an XLSX or something, I assume I’ll run into the same issue. Edit: yes, same issue. While the Google export does work fine, I end up with way more rows than input URLs because depending on the scraped content it creates a new row.
I can’t figure out how to have the input URL in column A of the output file and the scraped content for that URL in column B. This is to be able to later merge the scraped data into existing sheets by matching the URLs within the existing sheets to parse the scraped content into the correct rows.

Tim · June 11, 2024, 3:18pm

Can you export the project task for me?

I will have a closer look into it.

scrap3r · June 11, 2024, 3:36pm

I’ve sent the export via DM.

P.S. Point 2 might be something you may have an answer to

scrap3r · June 19, 2024, 12:19pm

@Tim Any update regarding the broken output?

Tim · June 20, 2024, 6:52am

I added some CSV escaping,

So all the text appears inside " " and any quotes etc are escaped as " in the text.

Re multilines, normally its a setting you might have to enable to allow line breaks inside csv cols.

AFAIK it should work fine in Google as it allows multiple lines in each cell.

Can you re run project with new update and let me know what has changed?

scrap3r · June 21, 2024, 3:33pm

@Tim still no luck unfortunately - the CSV is still “broken” whether I open it locally with (I tried various settings) or import it into Google Sheets - same results just that some quotation marks are now escaped.

Tim · June 21, 2024, 6:04pm

Can you give some screenshots of what error looks like?

Tim · June 21, 2024, 6:05pm

Also you should export to google sheets directly and not via csv

scrap3r · June 21, 2024, 7:32pm

@Tim I just DM’d you the CSV file that SCM created so you can analyze it. I’ll try direct Google Sheets export tomorrow and update you.

Tim · June 22, 2024, 6:43pm

Error found in the CSV append to file code.

It was removing some lines causing the " formatting to break.

Its been fixed in newest update

Please test it

scrap3r · June 23, 2024, 8:47am

@Tim This did not fix the bug unfortunately. I just DM’d you the CSV that the SCM version with “fix: Dynamic scraper, csv output append to file corrupting output” created.

Edit: Sorry, it’s fixed! I didn’t delete the old CSV and re-running the project didn’t overwrite it. So what I was looking at was the data from the previous run with the previous SCM version…

Tim · June 23, 2024, 9:07am

Lovely all fixed right?

scrap3r · June 23, 2024, 11:49am

@Tim I was wrong about being wrong! It’s not fixed for my particular project at least. DM’d you the file.

Tim · June 23, 2024, 5:48pm

I did a new update to change the " escape to " " , which on my testing worked better

scrap3r · June 24, 2024, 11:09am

@Tim Fix confirmed! Thank you!!!