How to scrape images or videos from a website
Overview
The only tool you need is a browser. We will be using the Firefox browser in the guide.
Simple Steps
1. Find a website you want to scrape
Ideally, find one that has a lot of images like scrolller.
2. Scroll down to the bottom of the page
You want to scroll down to the bottom of the page to make sure the website has loaded all the content.
If the website uses infinity scroll, just stop scrolling to the bottom when you feel you got enough content. Otherwise you basically will be scrolling to infinity.
3. Copy the inner HTML of the body tag
You want to right-click anywhere in the page and select Inspect. From there you want to scroll to the top until you see the opening <body>
tag. Here right-click on it, navigate to Copy and then Inner HTML.
In the chance, that you cannot right-click on the webpage, you can also navigate to the developer tools by navigating to the top right and clicking on the hamburger icon > More tools > Web developer tools
or the hotkeys Ctrl+Shift+I
or F12
and navigating to the Inspector
tab.
4. Paste the code into textcompare
Paste the code into textcompare regex matcher
5. Extract the URLs
Next choose the regular expression to be URL and click on Show Output to get the results.
6. Filter out the unwanted links
Go through the list you have just generated and delete the links that don't point to media files.
For example anything that doesn't end with:
.png
.jpg
.mp4
.gif
- etc.
Intermediate Steps
The 6th step is personally too much grunt work to get the final list. Thus we will improve upon the 5th step.
5.1 Use the full power of RegEx
After generating the list in the 6th step, you hopefully noticed that the media files are following a pattern. We will use this pattern to our advantage.
Right before the /gmi
, we will add the file type extension.
By default when we click on regular expression URL we get
1/(https?:\/\/)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/gmi
If we want to filter over JPG, we will add jpg
to get
1/(https?:\/\/)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)jpg/gmi
5.2 Swap out words that almost follow the pattern
We notice that some URLs are quite similar. So to remove this redundant data, we want to use the Find and Replace tool (make sure you have Copy to Clipboard before continuing).
This might need a couple of iterations to create identical duplicates of the elements in the pairs.
Thus we will first find the first word and replace it with a word that comes closer to make a duplicate.
Make sure to click on Show Output and copy-paste this output as your new input.
Example: Scroller
The first thing you hopefully noticed is that lots of images have 1080
one to two times in the URL. Thus, we will remove this redundancy.
In the Search field we will add
1/-(\d*x1080|1080x\d*)\.jpg/gmi
and make sure the Find using Regular Expression is checked. This will look for all URLs that has one the patterns
1-<number>x1080.jpg
2-1080x<number>.jpg
We will keep the Replace with field blank, since we want to create a pattern that is NOT equivalent to that one mentioned in step 5.1.
Click Show Output to get the result.
In this case we will run step 5.1 again before continuing to step 5.3.
5.3 Remove duplicates
Now if we go through the list, we have a lot of duplicates. Next we will use the Remove Duplicate Lines Tool.
Click on the Show Output button to get your final result.
Advanced Steps
There are a lot of things we can improve even further. First off the 2nd step is just too much work.
2.1 Use a function to do the work
Instead of manually scrolling down to the bottom of the page every couple of seconds, we will use a function to do this for us. Though before we get to that, one should approximately count how many seconds it takes for the loading to complete.
Go to the Console tab in the Developers Tool window. Here we will use the window.scrollTo
function to do the work for us
1var stop = setInterval (()=>window.scrollTo(0,document.body.scrollHeight), timeInMs)
To stop the function just use
1clearInterval(stop)
2.0.0 Using the network tab
Sometimes the website shares the response in a human readable format like JSON. We will use this to our advantage to skip the whole scrolling step.
2.0.1 Navigate to the network tab
In the Developers Tool window, navigate to the Network tab. One might be introduced with the message of needing to reload the page. Do so if needed.
2.0.2 Find the JSON file containing the data
Once the page has finished loading, we will need to find the correct JSON file.
If you click on XHR, this will ease the search some what.
Make sure to view each JSON file to make sure it is the correct one. One is looking for an URL that is point to the media file location. Usually it will be the largest one of the bunch.
2.0.3 Continue or simplify even more
With the correct JSON file, one can continue to step 4 and copy-paste it into the Input field.
Though if you have noticed, your JSON only holds a limited amount of data. The data comes in chunks to save internet transfer sizes.
The issue with this is that to get the next chunk, you will have to scroll down until it comes through again.
Ideally you want to modify the request in such a fashion that you do not have to do this extra step.
2.0.4 How to modify the request
Here we will need a secondary tool to help create the HTTP request. We will be using postman (no account is needed to use it).
On the request, we want to Copy to cURL. Next we want to Import this to postman. It will then fill out the form for us.
In the request body, we want to see if we can return more data. Though first we want to try if we get any data.
Next we notice that the limit is set to 50. We are going to change this to 100 and then test if we get more data. (Just because you change the limit, doesn't mean the server allows this change.)
It looks like it allows the limit to be 100. What about if we change it to the max amount, which you hopefully noticed while looking for the correct JSON file.
It seems like this is not working as expecting.
2.0.5 Finding the trigger to load more data
We won't be sifting through the source code to find the trigger and then tracking it to see which URL is used etc. Although this is a viable option, if all other methods fail.
Instead we will see if the JSON file gives us any clues of knowing which chunk to and see there, there is a key-value pair that holds this info.
2.0.6 Using this new trigger point
Now that we know how the request is referencing each further call, we can just copy-paste the response body. Change the request body and repeat until we got all the data we wanted.
2.0.7 Optional: Make a script
I won't go too much into details, but one could make a script, especially if you are dealing with a lot data chunks.
Considerations
Why not just download the website?
The issue is that most modern websites let the site be built on the user's side. If you check the source code (by right-clicking on the page and choosing View Page Source), you will notice that it doesn't align with what you saw in the Inspect tab of the Developers Tool window.
Why scrape the site in the first place?
Other than the obvious solution of having a personal copy of the data or for historic purposes.
Lots of media sharing sites don't have a slide show feature. So in the best case you can at least click a next button or in the worst case, each file will redirect you to its own unique URL.
There are options of creating your own scripts to run directly in the browser. Though this only works in a pinch.
What to do with the scraped data?
First of all if you are on Linux with wget
you could download all the files by just reading the scraped file
1wget -i scraped.txt
Be careful this might give you faulty file names.
Another option is to watch a slideshow with VLC.
How to use VLC as a slideshow?
The simplest option is to just open the scraped.txt
file.
Another option is to open multiple media files.
Remark: VLC won't play GIFs. So to get around this issue just convert the GIF e.g. to a MP4 with FFmpeg. You will have to download the files first with wget
before converting them.