Finding and selecting the right bit of content from a web page can be difficult.
The Dynamic scraper has a bunch of filtering options to help you massage data into the correct format.
Filter button
Once you added any element for selection, the filter button appears
- Shows you how many items matching
- Clicking it shows filter modal
Filter modal has 2 sections.
- Filtering items
- Live display
Wrap lines
This adds content to the start and end of lines
Useful for wrapping text in tags
This example illustrates an important point, all filtering items work per line.
Min length
You can remove lines that are too small
Its possible to remove lines to just leave the news title
Min length 15 characters
Line ignore + Line keep
We can ignore lines and keep lines using words
Good for repetitive content that is easy to remove
We can remove lines using words like ‘ago’ and ‘News’ instead of using a length filter
Line regex replace
This is where we can remove items, replace words and really clean up information
We can use regex to match lines and remove content
The match is case sensitive
I was able to remove the first line entirely
To do a replacement, I add =>
Because this is regex, I can do special things like matching and replacing only numbers
I match numbers and replace it with ‘xxx’
For more help with regex, there is a link on the info tooltip
The link takes you to https://regex101.com/
A brilliant site for testing regex
Global filter
The filtering section at the bottom of the task runs for all selections
The order is:
- Run rule filters
- Run global filters
This is important, it means you can select your elements and filter them on an individual level then clean them up with a global filter.
Eg, select items and then unwrap tags
Lots of selected items are inside HTML tags
I can unwrap div tags, and remove img tags…
Summary
- Filtering is done on a per line basis
- You can filter by line length, words and regex
- A global filter runs at the end and can be used to clean up data