Help Center

How To Get Crawl Run URLs

Last Updated: Mar 01, 2017 12:32PM PST
This article describes how to get the URLs from a crawl run (extractor run).

In the following examples that follow the curl and jq commands will be used which can be readily applied to other programming languages such as (Java, Python, Perl, etc).

To get the current information about extractor you can run the following (NOTE: EXTRACTOR_ID and IMPORT_IO_API_KEY are your specific extractor id (from the URL in the desktop view) and api key from your account information):

import.io $ curl -X GET -s "https://store.import.io/store/extractor/$EXTRACTOR_ID?_apikey=$IMPORT_IO_API_KEY" | jq .
{
  "_meta": {
    "timestamp": 1488398705971,
    "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
    "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
    "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
    "creationTimestamp": 1487031762391
  },
  "guid": "8560e178-e21d-4fea-b0e4-b65ea4320714",
  "name": "yelp.com-details",
  "fields": [
    {
      "id": "5b476a28-0056-4fcb-93fc-2068487b6700",
      "name": "name",
      "captureLink": false,
      "type": "TEXT"
    },
    {
      "id": "9c017ad7-ac85-453d-971e-c865ad386f16",
      "name": "reviews",
      "captureLink": false,
      "type": "AUTO"
    },
    {
      "id": "0010bd3a-5e02-463d-ab08-8f0bda4e92c3",
      "name": "cuisine",
      "captureLink": false,
      "type": "AUTO"
    }
  ],
  "latestConfigId": "9f56c8ae-5768-49b1-8fcf-697dc63db379",
  "training": "61edd0ba-6e80-4b34-b6fc-45c1730afce0",
  "urlList": "0c2e6446-923e-452f-9356-f68edc8347ff"
}

The URLs themselves are saved as a separate attachment and is accessed via
the Extractor field urlList as follows:

import.io $ curl -s -X GET -H 'Accept-Encoding: gzip' --compressed "https://store.import.io/store/extractor/$EXTRACTOR_ID/_attachment/urlList/0c2e6446-923e-452f-9356-f68edc8347ff?_apikey=$IMPORT_IO_API_KEY"

https://www.yelp.com/search?find_desc=tacos&find_loc=San+Jose%2C+CA&ns=1

A list of the crawl runs (as viewed in the tab Run History in the Import.io Dashboard) is displayed using the following API called:
import.io $ curl -s "https://store.import.io/store/crawlrun/_search?_sort=_meta.creationTimestamp&_page=1&_perPage=30&extractorId=$EXTRACTOR_ID&_apikey=$IMPORT_IO_API_KEY" | jq .
{
  "took": 2,
  "timed_out": false,
  "hits": {
    "total": 3,
    "hits": [
      {
        "_type": "CrawlRun",
        "_id": "ab9bb66b-ab40-421b-a083-fc075ff9f24f",
        "_score": 0,
        "fields": {
          "_meta": {
            "timestamp": 1488398705946,
            "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
            "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
            "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
            "creationTimestamp": 1488398699884
          },
          "guid": "ab9bb66b-ab40-421b-a083-fc075ff9f24f",
          "runtimeConfigId": "9f56c8ae-5768-49b1-8fcf-697dc63db379",
          "extractorId": "8560e178-e21d-4fea-b0e4-b65ea4320714",
          "startedAt": 1488398701100,
          "stoppedAt": 1488398705945,
          "totalUrlCount": 1,
          "successUrlCount": 1,
          "failedUrlCount": 0,
          "rowCount": 10,
          "state": "FINISHED",
          "urlListId": "0c2e6446-923e-452f-9356-f68edc8347ff",
          "json": "470d9c46-7b71-4636-8dd0-50ac83539b16",
          "csv": "5529f964-deff-4257-9b25-db7e258f7465",
          "log": "6877cf09-ad47-44f2-94d9-5c63bbec01cc",
          "sample": "fa8612da-2cb5-4b80-8228-0c044c802407"
        }
      },
      {
        "_type": "CrawlRun",
        "_id": "2b9fb92f-2500-4c03-84b5-4683fe9fbb09",
        "_score": 0,
        "fields": {
          "_meta": {
            "timestamp": 1488397872333,
            "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
            "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
            "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
            "creationTimestamp": 1488397837002
          },
          "guid": "2b9fb92f-2500-4c03-84b5-4683fe9fbb09",
          "runtimeConfigId": "de177ec7-90e6-44af-8bc6-520447657a62",
          "extractorId": "8560e178-e21d-4fea-b0e4-b65ea4320714",
          "startedAt": 1488397837851,
          "stoppedAt": 1488397872332,
          "totalUrlCount": 11,
          "successUrlCount": 11,
          "failedUrlCount": 0,
          "rowCount": 220,
          "state": "FINISHED",
          "urlListId": "10aed23c-a9c4-4351-8095-5727a44a02a3",
          "json": "2de1e68c-4993-4940-87a9-5b76528087fd",
          "csv": "f55bf49c-b2d9-43fe-9bd8-7ec141936e28",
          "log": "6198f94e-c451-4306-b706-eafdba05be5b",
          "sample": "b3ee13b0-b716-432d-a4f8-bc2beca90188"
        }
      },
      {
        "_type": "CrawlRun",
        "_id": "2a6dc3e5-ddec-40c0-a2c4-9fbee86b1a90",
        "_score": 0,
        "fields": {
          "_meta": {
            "timestamp": 1487033608913,
            "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
            "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
            "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
            "creationTimestamp": 1487033441315
          },
          "guid": "2a6dc3e5-ddec-40c0-a2c4-9fbee86b1a90",
          "runtimeConfigId": "de177ec7-90e6-44af-8bc6-520447657a62",
          "extractorId": "8560e178-e21d-4fea-b0e4-b65ea4320714",
          "startedAt": 1487033441838,
          "stoppedAt": 1487033608905,
          "totalUrlCount": 38,
          "successUrlCount": 38,
          "failedUrlCount": 0,
          "rowCount": 748,
          "state": "FINISHED",
          "urlListId": "ce935349-98ff-4ae8-b458-127468df1b41",
          "json": "012b66b8-eb4d-45fc-a3ba-aa96d6c18f97",
          "csv": "f0f5e838-88ed-46b1-9dc0-53a9245d227f",
          "log": "1db579cc-8476-4dae-9226-a1aa06d48a12",
          "sample": "b9c87329-4966-4330-835d-9d31a53bcaec"
        }
      }
    ],
    "max_score": 0
  }
}
The URLs for a specific crawl run (The last crawl run in the example above) is the following with the urListId from above in bold shown below here in bold:

import.io $ curl -s -X GET -H 'Accept-Encoding: gzip' --compressed "https://store.import.io/store/extractor/$EXTRACTOR_ID/_attachment/urlList/ce935349-98ff-4ae8-b458-127468df1b41?_apikey=$IMPORT_IO_API_KEY"
https://www.yelp.com/biz/hello-robin-seattle?start=0
https://www.yelp.com/biz/hello-robin-seattle?start=10
https://www.yelp.com/biz/hello-robin-seattle?start=20
https://www.yelp.com/biz/hello-robin-seattle?start=30
https://www.yelp.com/biz/hello-robin-seattle?start=40
https://www.yelp.com/biz/hello-robin-seattle?start=50
https://www.yelp.com/biz/hello-robin-seattle?start=60
https://www.yelp.com/biz/hello-robin-seattle?start=70
https://www.yelp.com/biz/hello-robin-seattle?start=80
https://www.yelp.com/biz/hello-robin-seattle?start=90
https://www.yelp.com/biz/hello-robin-seattle?start=100
https://www.yelp.com/biz/hello-robin-seattle?start=110
https://www.yelp.com/biz/hello-robin-seattle?start=120
https://www.yelp.com/biz/hello-robin-seattle?start=130
https://www.yelp.com/biz/hello-robin-seattle?start=140
https://www.yelp.com/biz/hello-robin-seattle?start=150
https://www.yelp.com/biz/hello-robin-seattle?start=160
https://www.yelp.com/biz/hello-robin-seattle?start=170
https://www.yelp.com/biz/hello-robin-seattle?start=180
https://www.yelp.com/biz/hello-robin-seattle?start=190
https://www.yelp.com/biz/hello-robin-seattle?start=200
https://www.yelp.com/biz/hello-robin-seattle?start=210
https://www.yelp.com/biz/hello-robin-seattle?start=220
https://www.yelp.com/biz/hello-robin-seattle?start=230
https://www.yelp.com/biz/hello-robin-seattle?start=240
https://www.yelp.com/biz/hello-robin-seattle?start=250
https://www.yelp.com/biz/hello-robin-seattle?start=260
https://www.yelp.com/biz/hello-robin-seattle?start=270
https://www.yelp.com/biz/hello-robin-seattle?start=280
https://www.yelp.com/biz/hello-robin-seattle?start=290
https://www.yelp.com/biz/hello-robin-seattle?start=300
https://www.yelp.com/biz/hello-robin-seattle?start=310
https://www.yelp.com/biz/hello-robin-seattle?start=320
https://www.yelp.com/biz/hello-robin-seattle?start=330
https://www.yelp.com/biz/hello-robin-seattle?start=340
https://www.yelp.com/biz/hello-robin-seattle?start=350
https://www.yelp.com/biz/hello-robin-seattle?start=360
https://www.yelp.com/biz/hello-robin-seattle?start=370 

 



 



 
c2d12fc2f876f019701e1c3951e354bd@importio.desk-mail.com
http://assets3.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete