Help Center

Simple Chaining: getting data from list and detail pages

Last Updated: Mar 09, 2017 03:44PM PST
 

"Set up a daily extraction of all products from a single category so I can monitor price trends over time."

 

Introduction

 

In this tutorial we will show you how to get data from Discogs.com.

 

One of the biggest challenges when extracting data from websites is often phrased differently:

 
  • How do I get data from a click away?
  • How do I drill down to get the detail from another page?
  • I want the data from the list page AND data from each of the detail pages
  • My data is on two levels
  • ... and so on...
 

Whilst there is no single way of describing the issue there is now one way to solve the problem.

 

We call it “chaining” and in this tutorial we will walk through a simple case of chaining two extractors to get the results we need from a well-known retailer.  



 

Step 1: Getting a list of products

 

Let’s build a List Extractor.

 

Here is a page that has a list of products I am interested in:

 

 

The URL for this page is:

 

https://www.discogs.com/search/?style_exact=House&style_exact=Disco

 

Let’s create an Extractor to this page:

   

Tip: as I am going to get most of the data I want from the next “level” I am going to configure the Extractor to only return 1 column of data which is the link to the page containing the detail.

 

Here is my Discogs List Extractor https://dash.import.io/3a1cde1e-3c11-4052-bbca-076ff05d32ac/settings

 



Step 2: get a full list of products

 

There are many many pages of products available... we want more than just the first!


Let’s look at how the URL changes when we go to page 2:
 

https://www.discogs.com/search/?style_exact=House&style_exact=Disco&page=2

 

There is a page=n parameter in the URL which means we will be able to generate a list of URLs that will return us as many pages of results as we need using the URL Generator.

 

Here’s how:

 

You now have an editable parameter in the URL!

 

 

I’m going to show you how to get 1000 items but you can obviously apply this principle to get more than that.  

 

Let’s do it!

 
  • Next to “Range of numbers” change the numbers to go from 2 to 20 (leave step = 1)
  • Now click “Add to List”
 


You should now have 20 URLs in your list.


 

Click “Save

 
 

Step 3: Run the List Extractor

 

Now click “Run URLs” and let Import.io run the Extractor for all 20 URLs.

 


 

Step 4: Create a Detail Extractor

 
So we have a list of 1000 products. Or rather we have a list of links that take us to pages like this:
 

 

....where there the data we want is shown.

 

We will call this the “Detail page”

 

Let’s build an Extractor to get the data from the Detail pages like this one:

 

https://www.discogs.com/Lazydisco-More-Tigers-/release/8763230

 

Here’s how:


 

 
  • Add columns and train them using point and click. I am adding the following:
    • Artist
    • Album
    • Label
    • Format
    • Country
    • Genre
    • Style

 

Tip: you can use the “Add URLs” feature to test this Extractor on other URLs.

 
  • Click Done
 

Here’s my Discogs Detail Extractor should you want to use it.

https://dash.import.io/46b86a5e-57ee-4aeb-839a-a701ba56b426

 



Step 5: Configure the Detail Extractor

 

Now are going to linking the List and Detail Extractors together so that the output of the List Extractor (1000 links to products) becomes the input of the Detail Extractor.

 

To configure this follow these steps:

 
  • Where is says “Extract from multiple URLs”, choose the option “URLs from another Extractor”




 

  • ​Search & find your List Extractor




 

  • Select the List Extractor
  • Select the column that contains the URLs to use as the inputs



 





Step 6: Check the List Extractor have finished running

 

This is what happens when you run the Detail Extractor:

  • Import.io will go to the List Extractor and pull the latest run of data
  • We will then strip out the column specified (in this case the “searchresults link”)
  • We will add those as the input URLs for the Detail Extractor
  • The Detail Extractor will start its run
 

What this means is that you need to allow the List Extractor to finish its run before you start the Detail Extractor run.

 

Check now that the List Extractor has completed its run.

 

Mine completed in and I got 1,000 rows in just under 20 seconds.

 

 

We can now move to the final step...

 
 

Step 7: Run the Detail Extractor

 

Return now to your Detail Extractor and click “Run URLs”.

 

There are 1,000 inputs to process so this may take a few minutes.

 

Once you are done you can download the data in CSV or JSON using the “Download” button.

 

 

Enjoy!

 

    





 
c2d12fc2f876f019701e1c3951e354bd@importio.desk-mail.com
http://assets3.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete