How to filter items in the Dynamic scraper

Finding and selecting the right bit of content from a web page can be difficult.

The Dynamic scraper has a bunch of filtering options to help you massage data into the correct format.

Filter button

Once you added any element for selection, the filter button appears

  • Shows you how many items matching
  • Clicking it shows filter modal

Filter modal has 2 sections.

  • Filtering items
  • Live display

Wrap lines

This adds content to the start and end of lines

Useful for wrapping text in tags

This example illustrates an important point, all filtering items work per line.

Min length

You can remove lines that are too small

Its possible to remove lines to just leave the news title

Min length 15 characters

Line ignore + Line keep

We can ignore lines and keep lines using words

Good for repetitive content that is easy to remove

We can remove lines using words like ‘ago’ and ‘News’ instead of using a length filter

Line regex replace

This is where we can remove items, replace words and really clean up information

We can use regex to match lines and remove content

image

The match is case sensitive

I was able to remove the first line entirely

To do a replacement, I add =>

image

Because this is regex, I can do special things like matching and replacing only numbers

image

I match numbers and replace it with ‘xxx’

For more help with regex, there is a link on the info tooltip

image

The link takes you to https://regex101.com/

A brilliant site for testing regex

image

Global filter

The filtering section at the bottom of the task runs for all selections

image

The order is:

  1. Run rule filters
  2. Run global filters

This is important, it means you can select your elements and filter them on an individual level then clean them up with a global filter.

Eg, select items and then unwrap tags

Lots of selected items are inside HTML tags

I can unwrap div tags, and remove img tags…

image

Summary

  • Filtering is done on a per line basis
  • You can filter by line length, words and regex
  • A global filter runs at the end and can be used to clean up data
2 Likes

Amazing updates! :rocket:

1 Like