Amazon web crawler

Amazon web crawler DEFAULT

Getting started with Amazon Kendra web crawler (Console)

You can use the Amazon Kendra console to get started using the Amazon Kendra web crawler. When you use the console, you specify the connection information you need to index the contents of the webpages crawled using the web crawler. For more information, see Using a web crawler data source.

If you want to use a web proxy server to connect to and crawl websites, you need to provide the website host name and port number. Web proxy credentials (stored in a secret in AWS Secrets Manager) are optional and you can use them to connect to a web proxy server that requires basic authentication.

If you want to use basic authentication of user name and password to access and crawl websites, you need to provide the website host name, port number, and your secret in AWS Secrets Manager that stores your authentication credentials.

The following procedure assumes that you created an index following step 1 of Getting started with an S3 bucket (Console).

To create the Amazon Kendra web crawler as a data source connector (console)

  1. Sign into the AWS Management Console and then open the Amazon Kendra console at https://console.aws.amazon.com/kendra/home.

  2. From the list of indexes, choose the index that you want to add the data source to.

  3. Choose Add data sources.

  4. From the list of data source connectors, choose WebCrawler.

  5. On the Specify data source details page, do the following:

    1. In the Name data source section, give your data source a name and optionally a description.

    2. (Optional) In the Tags section, add tags to categorize your data source.

    3. Choose Next.

  6. On the Define access and security page, do the following:

    1. In the Source section, do one of the following:

      • Choose Source URLs to enter the seed URLs of the website domains you want to crawl. Enter the seed or starting point URL, and then choose Add new URL. You can also add website subdomains. You can add up to ten seed URLs.

      • Choose Source Sitemaps to enter the website sitemap URLs you want to crawl. A sitemap includes all relevant webpages or website domains you want to crawl. Enter the sitemap URL, and then choose Add new URL. You can add up to three sitemap URLs.

      Note

      You can only crawl websites that use the secure communication protocol, Hypertext Transfer Protocol Secure (HTTPS). If you receive a validation exception error when trying to crawl a website, it could be due the website being blocked from crawling.

    2. (Optional) In the Web proxy section, do the following:

      1. In Host name and Port number, enter the website host name and port number. For example, the host name of https://a.example.com/page1.html is "a.example.com" and the port is 443, the standard port for HTTPS.

      2. Under Web proxy credentials, to use web proxy credentials to connect to a web proxy server that requires basic authentication, choose AWS Secrets Manager secret. For more information, see AWS Secrets Manager

    3. (Optional) In the Hosts with authentication section, to connect to websites that require user authentication, choose Add additional host with authentication.

    4. In the IAM role section, in IAM role, choose an existing role that grants Amazon Kendra permission to access your web crawler resources such as your index. For more information about the required permissions, see IAM access roles for Amazon Kendra.

    5. Choose Next.

  7. On the Configure sync settings page, do the following:

    1. If you selected Source URLs in step 6a, in Crawl range, do one of the following:

      • Keep Crawl host domains only.

      • To include subdomains, choose Crawl host domains and their subdomains only
      • To include subdomains and other domains the webpages link to, choose Crawl everything.

    2. In Crawl depth, set the depth to the number of levels in a website from the seed level that you want to crawl. For example, if a website has three levels – index level or seed level in this example, sections level, and subsections level – and you are only interested in crawling information up to the sections level (levels 0 to 1), set your depth to 1.

    3. Choose Advanced crawl settings to set the maximum size (in MB) of a webpage to crawl, the maximum number of URLs on a single webpage to also crawl, and the maximum number of URLs crawled per website host per minute.

    4. Choose Additional configuration to use regular expression patterns to include or exclude certain URLs to crawl.

    5. In the Sync schedule section, for Frequency, choose the frequency to sync your index with your web crawler data source. You can sync hourly, daily, weekly, monthly, run on demand, or you can choose your own custom sync schedule.

    6. Choose Next.

  8. On the Review and Create page, review the details of your web crawler data source. To make changes, choose the Edit button next to the item that you want to change. When you are done, choose Add data source to add your web crawler data source.

After you choose Add data source, Amazon Kendra starts web crawling. It can take several minutes to a few hours for the web crawling to complete, depending on the number and size of the webpages to crawl. When it is finished, the status changes from Creating to Active.

Amazon Kendra syncs the index with web crawler in accordance with the sync schedule you set. If you choose Sync now to start the sync process immediately, it can take several minutes to a few hours to synchronize, depending on the number and size of the documents.

When selecting websites to index, you must adhere to the Amazon Acceptable Use Policy and all other Amazon terms. Remember that you must only use the Amazon Kendra web crawler to index your own webpages, or webpages that you have authorization to index. To learn how to stop the Amazon Kendra web crawler from indexing your website(s), please see Stopping Amazon Kendra web crawler from indexing your website.

Document Conventions

Getting started with SharePoint (Console)

Getting started with Amazon WorkDocs (Console)

Sours: https://docs.aws.amazon.com/kendra/latest/dg/getting-started-webcrawler.html

Scaling up a Serverless Web Crawler and Search Engine

Introduction

Building a search engine can be a daunting undertaking. You must continually scrape the web and index its content so it can be retrieved quickly in response to a user’s query. The goal is to implement this in a way that avoids infrastructure complexity while remaining elastic. However, the architecture that achieves this is not necessarily obvious. In this blog post, we will describe a serverless search engine that can scale to crawl and index large web pages.

A simple search engine is composed of two main components:

  • A web crawler (or web scraper) to extract and store content from the web
  • An index to answer search queries

Web Crawler

You may have already read “Serverless Architecture for a Web Scraping Solution.” In this post, Dzidas reviews two different serverless architectures for a web scraper on AWS. Using AWS Lambda provides a simple and cost-effective option for crawling a website. However, it comes with a caveat: the Lambda timeout capped crawling time at 15 minutes. You can tackle this limitation and build a serverless web crawler that can scale to crawl larger portions of the web.

A typical web crawler algorithm uses a queue of URLs to visit. It performs the following:

  • It takes a URL off the queue
  • It visits the page at that URL
  • It scrapes any URLs it can find on the page
  • It pushes the ones that it hasn’t visited yet onto the queue
  • It repeats the preceding steps until the URL queue is empty

Even if we parallelize visiting URLs, we may still exceed the 15-minute limit for larger websites.

Breaking Down the Web Crawler Algorithm

AWS Step Functions is a serverless function orchestrator. It enables you to sequence one or more AWS Lambda functions to create a longer running workflow. It’s possible to break down this web crawler algorithm into steps that can be run in individual Lambda functions. The individual steps can then be composed into a state machine, orchestrated by AWS Step Functions.

Here is a possible state machine you can use to implement this web crawler algorithm:

Figure 1: Basic State Machine

Figure 1: Basic State Machine

1. ReadQueuedUrls – reads any non-visited URLs from our queue
2. QueueContainsUrls? – checks whether there are non-visited URLs remaining
3. CrawlPageAndQueueUrls – takes one URL off the queue, visits it, and writes any newly discovered URLs to the queue
4. CompleteCrawl – when there are no URLs in the queue, we’re done!

Each part of the algorithm can now be implemented as a separate Lambda function. Instead of the entire process being bound by the 15-minute timeout, this limit will now only apply to each individual step.

Where you might have previously used an in-memory queue, you now need a URL queue that will persist between steps. One option is to pass the queue around as an input and output of each step. However, you may be bound by the maximum I/O sizes for Step Functions. Instead, you can represent the queue as an Amazon DynamoDB table, which each Lambda function may read from or write to. The queue is only required for the duration of the crawl. So you can create the DynamoDB table at the start of the execution, and delete it once the crawler has finished.

Scaling up

Crawling one page at a time is going to be a bit slow. You can use the Step Functions “Map state” to run the CrawlPageAndQueueUrls to scrape multiple URLs at once. You should be careful not to bombard a website with thousands of parallel requests. Instead, you can take a fixed-size batch of URLs from the queue in the ReadQueuedUrls step.

An important limit to consider when working with Step Functions is the maximum execution history size. You can protect against hitting this limit by following the recommended approach of splitting work across multiple workflow executions. You can do this by checking the total number of URLs visited on each iteration. If this exceeds a threshold, you can spawn a new Step Functions execution to continue crawling.

Step Functions has native support for error handling and retries. You can take advantage of this to make the web crawler more robust to failures.

With these scaling improvements, here’s our final state machine:

Figure 2: Final State Machine

Figure 2: Final State Machine

This includes the same steps as before (1-4), but also two additional steps (5 and 6) responsible for breaking the workflow into multiple state machine executions.

Search Index

Deploying a scalable, efficient, and full-text search engine that provides relevant results can be complex and involve operational overheads. Amazon Kendra is a fully managed service, so there are no servers to provision. This makes it an ideal choice for our use case. Amazon Kendra supports HTML documents. This means you can store the raw HTML from the crawled web pages in Amazon Simple Storage Service (S3). Amazon Kendra will provide a machine learning powered search capability on top, which gives users fast and relevant results for their search queries.

Amazon Kendra does have limits on the number of documents stored and daily queries. However, additional capacity can be added to meet demand through query or document storage bundles.

The CrawlPageAndQueueUrls step writes the content of the web page it visits to S3. It also writes some metadata to help Amazon Kendra rank or present results. After crawling is complete, it can then trigger a data source sync job to ensure that the index stays up to date.

One aspect to be mindful of while employing Amazon Kendra in your solution is its cost model. It is priced per index/hour, which is more favorable for large-scale enterprise usage, than for smaller personal projects. We recommend you take note of the free tier of Amazon Kendra’s Developer Edition before getting started.

Overall Architecture

You can add in one more DynamoDB table to monitor your web crawl history. Here is the architecture for our solution:

Figure 3: Overall Architecture

Figure 3: Overall Architecture

A sample Node.js implementation of this architecture can be found on GitHub.

In this sample, a Lambda layer provides a Chromium binary (via chrome-aws-lambda). It uses Puppeteer to extract content and URLs from visited web pages. Infrastructure is defined using the AWS Cloud Development Kit (CDK), which automates the provisioning of cloud applications through AWS CloudFormation.

The Amazon Kendra component of the example is optional. You can deploy just the serverless web crawler if preferred.

Conclusion

If you use fully managed AWS services, then building a serverless web crawler and search engine isn’t as daunting as it might first seem. We’ve explored ways to run crawler jobs in parallel and scale a web crawler using AWS Step Functions. We’ve utilized Amazon Kendra to return meaningful results for queries of our unstructured crawled content. We achieve all this without the operational overheads of building a search index from scratch. Review the sample code for a deeper dive into how to implement this architecture.

Sours: https://aws.amazon.com/blogs/architecture/scaling-up-a-serverless-web-crawler-and-search-engine/
  1. 2018 road king reviews
  2. John puller series 4
  3. Seasons catering albany ny

Posted On: Jul 7, 2021

Amazon Kendra is an intelligent search service powered by machine learning, enabling organizations to provide relevant information to customers and employees, when they need it. Starting today, AWS customers can use the Amazon Kendra web crawler to index and search webpages.

Critical information can be scattered across multiple data sources in an enterprise, including internal and external websites. Amazon Kendra customers can now use the Kendra web crawler to index documents made available on websites (HTML, PDF, MS Word, MS PowerPoint, and Plain Text) and search for information across this content using Kendra Intelligent Search. Organizations can provide relevant search results to users seeking answers to their questions, for example, product specification detail that resides on a support website or company travel policy information that’s listed on an intranet webpage.

Note: The Kendra web crawler honors access rules in robots.txt, and customers using the Kendra web crawler will need to ensure they are authorized to index those webpages in order to return search results for end users.

Sours: https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-kendra-releases-web-crawler-to-enable-web-site-search/

7 Most useful tools to scrape data from Amazon

This article gives you an idea of what web scraping tool you should use for scraping data from amazon.

The list includes small-scale extension tools and multi-functional web scraping software and they are compared in three dimensions: The degree of automation / how friendly the user interface is / how much can be used freely

TOP 7 Amazon Scraping Tools:

Browser extensions:

Scraping software:

Browser Extensions

The key to an extension is easy to reach. You can get the idea of web scraping rapidly. With rather basic functions, these options are fit for casual scraping or small businesses in need of information in simple structure and small amounts.

browser extensions for web scraping

Data Miner

Data miner is an extension tool that works on Google Chrome and Microsoft Edge. It helps you scrape data from web pages into a CSV file or Excel spreadsheet. A number of custom recipes are available for scraping amazon data. If those offered are exactly what you need, this could be a handy tool for you to scrape from Amazon within a few clicks.

Data miner scraping amazon

Data scraped by Data Miner

Data Miner has a step-by-step friendly interface and basic functions in web scraping. It’s more recommendable for small businesses or casual use.

There is a page limit (500/month) for the free plan with Data Miner. If you need to scrape more, professional and other paid plans are available.

Web Scraper 

Web Scraper is an extension tool with a point and click interface integrated into the developer tool. Without certain templates for e-commerce or Amazon scraping, you have to build your own crawler by selecting the listing information you want on the web page.

web scraper scraping amazon

UI integrated in the developer tool

Web scraper is equipped with functions (available for paid plan) such as cloud extraction, scheduled scraping, IP rotation, API access. Thus it is capable of more frequent scraping and scraping of a larger volume of information. 

Scraper Parsers

Scraper Parsers is a browser extension tool to extract unstructured data and visualize without code. Data extracted can be viewed on the site or downloaded in various forms (XLSX, XLS, XML, CSV). With data extracted, numbers can be displayed in charts accordingly.

scraper parsers gets amazon data

Small draggable Panel

The UI of Parsers is a panel you can drag around and select by clicks on the browser and it also supports scheduled scraping. However, it seems not stable enough and easily gets stuck. For a visitor, the limit of use is 600 pages per site. You can get 590 more if you sign up.

Amazon Scraper - Trial Version

Amazon scraper is approachable on Chrome’s extension store. It can help scrape price, shipping cost, product header, product information, product images, ASIN from the Amazon search page.

amazon scraper

Right-click and scrape

Go to the Amazon website and search. When you are on the search page with results you want to scrape from, right-click and choose the "Scrap Asin From This Page" option.  Information will be extracted and save as a CSV file.

This trial version can only download 2 pages of any search query. You need to buy the full version to download unlimited pages and get 1-year free support.

Scraping Software

In order to solve these problems, you need a more powerful tool.

web scraping softwares

Octoparse 

Octoparse is a free-for-life web scraping tool. It helps users quickly scrape web data without coding. Compared with others, the highlight of this product is its graphic, intuitive UI design. Worth mentioning, its auto-detection function can save your efforts of perplexedly clicking around with messed-up data results.

Besides auto-detection, amazon templates are even more convenient. Using templates, you can obtain the product list information as well as detailed page information on Amazon. You can also create a more customized crawler by yourself under the advanced mode.

octoparse templates

Plenty of templates available for use on Octoparse

There is no limit for the amount of data scraped even with a free plan as long as you keep your data within 10,000 rows per task.

octoparse scraped amazon data

Amazon data scraped using Octoparse

Powerful functions such as cloud service, scheduled automatic scraping, IP rotation (to prevent IP ban) are offered in a paid plan. If you want to monitor stock numbers, prices, and other information about an array of shops/products at a regular basis, they are definitely helpful.

Related Tutorial: 

Scrape product details from Amazon

Scrape reviews from Amazon

ScrapeStorm

ScrapeStorm is an AI-powered visual web scraping tool. Its smart mode works similar to the auto-detection in Octoparse, intelligently identifying the data with little manual operation required. So you just need to click and enter the URL of the amazon page you want to scrape from.

Its Pre Login function helps you scrape URLs that require login to view content. Generally speaking, the UI design of the app is like a browser and comfortable to use.

Data scraped using ScrapeStorm   

ScrapeStorm offers a free quota of 100 rows of data per day and one concurrent run is allowed. The value of data comes as you have enough of them for analysis, so you should think of upgrading your service if you choose this tool. Upgrade to the professional so that you can get 10,000 rows per day.

ParseHub

ParseHub is another free web scraper available for direct download. As most of the scraping tools above, it supports crawler building in a click-and-select way and export of data into structured spreadsheets.

For Amazon scrapers, Parsehub doesn’t support auto-detection or offer any Amazon templates, however, if you have prior experience using a scraping tool to build customized crawlers, you can take a shot on this.

Build your crawler on Parsehub

You can save images and files to DropBox, run with IP rotation and scheduling if you start from a standard plan. Free plan users will get 200 pages per run. Don’t forget to backup your data (14-day data retention).

Something More than Tools

Tools are created for convenience use. They make complicated operations possible through a few clicks on a bunch of buttons.

However, it is also common for users to counter unexpected errors because the situation is ever-changing on different sites. You can step a little bit deeper to rescue yourself from such a dilemma - learna bit about htmland Xpath. Not so far to become a coder, just a few steps to know the tool better.

If the tool is not your thing, and you're finding a data service for your project, Octoparse data service is a good choice. We work closely with you to understand your data requirement and make sure we deliver what you desire. Talk to Octoparse data expert now to discuss how web scraping services can help you maximize efforts. 

octoparse data service

Author: Cici 

9 Best Free Web Crawlers for Beginners in 2020

3 Most Practical Uses of eCommerce Data Scraping Tools

Web Data Extraction: The Definitive Guide 2020

Best Web Scraper for Mac:Scrape Data from Any Website with your Apple Device

Video:3 Easy Steps to Boost Your eCommerce Buiness

Sours: https://www.octoparse.com/blog/most-useful-tools-to-scrape-data-from-amazon

Web crawler amazon

Features

This actor will crawl items for specified keywords on Amazon and will automatically extract all pages for those keywords. The scraper then extracts all seller offers for each given keyword, so if there is pagination on the seller offers page, note that you will get all offers.

Find out more about why you should use this scraper for your business and suggestions on how to use the data in this YouTube Video.

Sample result

Proxy

The actor needs proxies to function correctly. We don't recommend running it on a free account for more than a sample of results. If you plan to run it for more than a few results, subscribing to the Apify platform will give you access to a large pool of proxies.

Asin crawling

One of the features of the scraper is that it can get price offers for a list of ASINs. If this what you need, you can specify the ASINs in the input along with the combination of countries to get results for.

With this setup, the scraper will check whether that ASIN is available for all countries and get all seller offers for it.

Direct URLs crawling

If you already have your ASINs and don't want to crawl them manually, you can enqueue the requests from the input.

Here is a sample object to get itemDetail info:

Here is a sample object to get seller info:

Additional options

maxResults - If you want to limit the number of results to be extracted, set this value with that number of results, otherwise keep it blank or 0. It doesn't work 100% precisely, in that, if you specify five results, it will create more records because of concurrency.

Compute unit consumption

Using raw requests - 0.0884 CU when extracting 20 results from keyword search Using a browser - 0.6025 CU when extracting 20 results from keyword search

Supported countries

You can specify the country where you want to scrape items. We currently support these countries:

If you want us to add another country, please email [email protected]

Changelog

Changes related to new versions are listed in the CHANGELOG file.

Sours: https://apify.com/vaclavrut/amazon-crawler
How to Use Glue Crawlers Efficiently to Build Your Data Lake Quickly - AWS Online Tech Talks

Using a web crawler data source

You can use the Amazon Kendra web crawler to crawl and index webpages. For a walk-through of how to use the web crawler in the console, see Getting started with Amazon Kendra web crawler (Console).

You can use the web crawler to crawl webpages and index them as your documents. For the websites you want to crawl and index, you provide either the seed or starting point URLs or the sitemap URLs. You can only crawl websites that use the secure communication protocol, Hypertext Transfer Protocol Secure (HTTPS). If you receive an error when crawling a website, it could be that the website is blocked from crawling.

You can configure the following crawl settings:

  • The range of websites to crawl: website host names only, or websites including subdomains, or websites including subdomains and others domains that the webpages link to.

  • The depth or number of levels in a website from the seed level to crawl. For example, if a website has 3 levels – index level or the seed level in this example, sections level, and subsections level – and you are only interested in crawling information from the index level to the sections level (levels 0 to 1), you can set your depth to 1.

  • The maximum number of URLs on a single webpage that are crawled.

  • The maximum size in MB of a webpage to crawl.

  • The maximum number of URLs crawled per website host per minute.

  • Regular expression patterns to include or exclude certain URLs to crawl.

  • The web proxy information to connect to and crawl internal websites.

  • The authentication information to access and crawl websites that require user authentication.

You can configure the web crawler using the CreateDataSource operation. You provide the web crawler configuration information in the WebCrawlerConfiguration structure.

You use the SeedUrlConfiguration structure to provide a list of seed URLs and choose whether to crawl only website host names, or include subdomains, or include subdomains and other domains the webpages link to. You also use the SiteMapsConfiguration structure to provide a list of sitemap URLs.

When selecting websites to index, you must adhere to the Amazon Acceptable Use Policy and all other Amazon terms. Remember that you must only use the Amazon Kendra web crawler to index your own webpages, or webpages that you have authorization to index. To learn how to stop the Amazon Kendra web crawler from indexing your website(s), please see Stopping the Amazon Kendra web crawler from indexing your website.

Website user authentication

Before connecting to the web crawler, you need to check if the websites you want to crawl require authentication to access the websites. If a website requires basic authentication, you provide web crawler with the host name of the website, the port number, and a secret in AWS Secrets Manager that stores your basic authentication credentials of your user name and password.

If you use the Amazon Kendra console, you can choose an existing secret. If you use the Amazon Kendra API, you must provide the Amazon Resource Name (ARN) of an existing secret that contains your user name and password. You can create a secret in AWS Secrets Manager.

The secret must contain the user name and password of the website that you want to crawl. The following is the minimum JSON structure that must be stored in the secret.

The secret can contain other information, but Amazon Kendra ignores it.

You use the AuthenticationConfiguration structure to provide the website host name, website port number, and the secret that stores your authentication credentials.

IAM role for web crawler

When you use the web crawler, you specify an IAM role that grants Amazon Kendra permission to access web crawler resources such as your index and secret. The secret stores your credentials for websites or web proxy servers that require basic authentication. The IAM role for the web crawler must have permission to access the secret and to use the AWS Key Management Service (AWS KMS) key to decrypt it. The IAM role also needs access to your index so that it can add and update crawled webpages to the index. For more information, see IAM access roles for Amazon Kendra.

Web proxy

You can use a web proxy to connect to internal websites you want to crawl. Amazon Kendra supports connecting to web proxy servers that are backed by basic authentication or you can connect with no authentication. You provide the host name of the website and the port number. You can also provide web proxy credentials using a secret in AWS Secrets Manager.

You use the ProxyConfiguration structure to provide the website host name and port number. You can also provide the secret that stores your web proxy credentials.

Document Conventions

Using a SharePoint data source

Stopping Amazon Kendra web crawler on your website

Sours: https://docs.aws.amazon.com/kendra/latest/dg/data-source-web-crawler.html

Similar news:

Hartley Brody

In its simplest form, web scraping is about making requests and extracting data from the response. For a small web scraping project, your code can be simple. You just need to find a few patterns in the URLs and in the HTML response and you’re in business.

But everything changes when you’re trying to pull over 1,000,000 products from the largest ecommerce website on the planet.

Amazon Crawling

When crawling a sufficiently large website, the actual web scraping (making requests and parsing HTML) becomes a very minor part of your program. Instead, you spend a lot of time figuring out how to keep the entire crawl running smoothly and efficiently.

This was my first time doing a scrape of this magnitude. I made some mistakes along the way, and learned a lot in the process. It took several days (and quite a few false starts) to finally crawl the millionth product. If I had to do it again, knowing what I now know, it would take just a few hours.

In this article, I’ll walk you through the high-level challenges of pulling off a crawl like this, and then run through all of the lessons I learned. At the end, I’ll show you the code I used to successfully pull 1MM+ items from amazon.com.

I’ve broken it up as follows:

  1. High-Level Challenges I Ran Into
  2. Crawling At Scale Lessons Learned
  3. Site-Specific Lessons I Learned About Amazon.com
  4. How My Finished, Final Code Works

High-Level Challenges I Ran Into

There were a few challenges I ran into that you’ll see on any large-scale crawl of more than a few hundred pages. These apply to crawling any site or running a sufficiently large crawling operation across multiple sites.

High-Performance is a Must

Now that is high-throughput

In a simple web scraping program, you make requests in a loop – one after the other. If a site takes 2-3 seconds to respond, then you’re looking at making 20-30 requests a minute. At this rate, your crawler would have to run for a month, non-stop before you made your millionth request.

Not only is this very slow, it’s also wasteful. The crawling machine is sitting there idly for those 2-3 seconds, waiting for the network to return before it can really do anything or start processing the next request. That’s a lot of dead time and wasted resources.

When thinking about crawling anything more than a few hundred pages, you really have to think about putting the pedal to the metal and pushing your program until it hits the bottleneck of some resources – most likely network or disk IO.

I didn’t need to do this for my purposeses (more later), but you can also think about ways to scale a single crawl across multiple machines, so that you can even start to push past single-machine limits.

Avoiding Bot Detection

Battling with Bots

Any site that has a vested interest in protecting its data will usually have some basic anti-scraping measures in place. Amazon.com is certainly no exception.

You have to have a few strategies up your sleeve to make sure that individual HTTP requests – as well as the larger pattern of requests in general – don’t appear to be coming from one centralized bot.

For this crawl, I made sure to:

  1. Spoof headers to make requests seem to be coming from a browser, not a script
  2. Rotate IPs using a list of over 500 proxy servers I had access to
  3. Strip "tracking" query params from the URLs to remove identifiers linking requests together

More on all of these in a bit.

The Crawler Needed to be Resilient

Just Keep Swimming

The crawler needs to be able to operate smoothly, even when faced with common issues like network errors or unexpected responses.

You also need to be able to pause and continue the crawl, updating code along the way, without going back to “square one”. This allows you to update parsing or crawling logic to fix small bugs, without needing to rescrape everything you did in the past few hours.

I didn’t have this functionality initially and I regretted it, wasting tons of hours hitting the same URLs again and again whenever I need to make updates to fix small bugs affecting only a few pages.


Crawling At Scale Lessons Learned

From the simple beginnings to the hundreds of lines of python I ended up with, I learned a lot in the process of running this project. All of these mistakes cost me time in some fashion, and learning the lessons I present here will make your amazon.com crawl much faster from start to finish.

1. Do the Back of the Napkin Math

When I did a sample crawl to test my parsing logic, I used a simple loop and made requests one at a time. After 30 minutes, I had pulled down about 1000 items.

Initially, I was pretty stoked. “Yay, my crawler works!” But when I turned it loose on a the full data set, I quickly realized it wasn’t feasible to run the crawl like this at full scale.

Doing the back of the napkin math, I realized I needed to be doing dozens of requests every second for the crawl to complete in a reasonable time (my goal was 4 hours).

This required me to go back to the drawing board.

2. Performance is Key, Need to be Multi-Threaded

In order to speed things up and not wait for each request, you’ll need to make your crawler multi-threaded. This allows the CPU to stay busy working on one response or another, even when each request is taking several seconds to complete.

You can’t rely on single-threaded, network blocking operations if you’re trying to do things quickly. I was able to get 200 threads running concurrently on my crawling machine, giving me a 200x speed improvement without hitting any resource bottlenecks.

3. Know Your Bottlenecks

You need to keep an eye on the four key resources of your crawling machine (CPU, memory, disk IO and network IO) and make sure you know which one you’re bumping up against.

What is keeping your program from making 1MM requests all at once?

The most likely resource you’ll use up is your network IO – the machine simply won’t be capable of writing to the network (making HTTP requests) or reading from the network (getting responses) fast enough, and this is what your program will be limited by.

Note that it’ll likely take hundreds of simultaneous requests before you get to this point. You should look at performance metrics before you assume your program is being limited by the network.

Depending on the size of your average requests and how complex your parsing logic, you also could run into CPU, memory or disk IO as a bottleneck.

You also might find bottlenecks before you hit any resource limits, like if your crawler gets blocked or throttled for making requests too quickly.

This can be avoided by properly disguising your request patterns, as I discuss below.

4. Use the Cloud

I used a single beefy EC2 cloud server from Amazon to run the crawl. This allowed me to spin up a very high-performance machine that I could use for a few hours at a time, without spending a ton of money.

It also meant that the crawl wasn’t running from my computer, burning my laptop’s resources and my local ISP’s network pipes.

5. Don’t Forget About Your Instances

The day after I completed the crawl, I woke up and realized I had left an m4.10xlarge running idly overnight. My reaction:

I probably wasted an extra $50 in EC2 fees for no reason. Make sure you stop your instances when you’re done with them!

6. Use a Proxy Service

This one is a bit of a no-brainer, since 1MM requests all coming from the same IP will definitely look suspicious to a site like amazon that can track crawlers.

I’ve found that it’s much easier (and cheaper) to let someone else orchestrate all of the proxy server setup and maintenance for hundreds of machines, instead of doing it yourself.

This allowed me to use one high-performance EC2 server for orchestration, and then rent bandwidth on hundreds of other machines for proxying out the requests.

I used ProxyBonanza and found it to be quick and simple to get access to hundreds of machines.

7. Don’t Keep Much in Runtime Memory

If you keep big lists or dictionaries in memory, you’re asking for trouble. What happens when you accidentally hit Ctrl-C when 3 hours into the scrape (as I did at one point)? Back to the beginning for you!

Make sure that the important progress information is stored somewhere more permanent.

8. Use a Database for Storing Product Information

Store each product that you crawl as a row in a database table. Definitely don’t keep them floating in memory or try to write them to a file yourself.

Databases will let you perform basic querying, exporting and deduping, and they also have lots of other great features. Just get in a good habit of using them for storing your crawl’s data.

9. Use Redis for Storing a Queue of URLs to Scrape

Store the “frontier” of URLs that you’re waiting to crawl in an in-memory cache like redis. This allows you to pause and continue your crawl without losing your place.

If the cache is accessible over the network, it also allows you to spin up multiple crawling machines and have them all pulling from the same backlog of URLs to crawl.

10. Log to a File, Not

While it’s temptingly easy to simply all of your output to the console via stdout, it’s much better to pipe everything into a log file. You can still view the log lines coming in, in real-time by running on the logfile.

Having the logs stored in a file makes it much easier to go back and look for issues. You can log things like network errors, missing data or other exceptional conditions.

I also found it helpful to log the current URL that was being crawled, so I could easily hop in, grab the current URL that was being crawled and see how deep it was in any category. I could also watch the logs fly by to get a sense of how fast requests were being made.

11. Use to Manage the Crawl Process instead of your SSH Client

If you SSH into the server and start your crawler with , what happens if the SSH connection closes? Maybe you close your laptop or the wifi connection drops. You don’t want that process to get orphaned and potentially die.

Using the built-in Unix command allows you to disconnect from your crawling process without worrying that it’ll go away. You can close your laptop and simple SSH back in later, reconnect to the screen, and you’ll see your crawling process still humming along.

12. Handle Exceptions Gracefully

You don’t want to start your crawler, go work on other stuff for 3 hours and then come back, only to find that it crashed 5 minutes after you started it.

Any time you run into an exceptional condition, simply log that it happened and continue. It makes sense to add exception handling around any code that interacts with the network or the HTML response.

Be especially aware of non-ascii characters breaking your logging.


Site-Specific Lessons I Learned About Amazon.com

Every site presents its own web scraping challenges. Part of any project is getting to know which patterns you can leverage, and which ones to avoid.

Here’s what I found.

13. Spoof Headers

Besides using proxies, the other classic obfuscation technique in web scraping is to spoof the headers of each request. For this crawl, I just grabbed the User Agent that my browser was sending as I visited the site.

If you don’t spoof the User Agent, you’ll get a generic anti-crawling response for every request Amazon.

In my experience, there was no need to spoof other headers or keep track of session cookies. Just make a GET request to the right URL – through a proxy server – and spoof the User Agent and that’s it – you’re past their defenses.

14. Strip Unnecessary Query Parameters from the URL

One thing I did out of an abundance of caution was to strip out unnecessary tracking parameters from the URL. I noticed that clicking around the site seemed to append random IDs to the URL that weren’t necessary to load the product page.

I was a bit worried that they could be used to tie requests to each other, even if they were coming from different machines, so I made sure my program stripped down URLs to only their core parts before making the request.

15. Amazon’s Pagination Doesn’t Go Very Deep

While some categories of products claim to contain tens of thousands of items, Amazon will only let you page through about 400 pages per category.

This is a common limit on many big sites, including Google search results. Humans don’t usually click past the first few pages of results, so the sites don’t bother to support that much pagination. It also means that going too deep into results can start to look a bit fishy.

If you want to pull in more than a few thousand products per category, you need to start with a list of lots of smaller subcategories and paginate through each of those. But keep in mind that many products are listed in multiple subcategories, so there may be a lot of duplication to watch out for.

16. Products Don’t Have Unique URLs

The same product can live at many different URLs, even after you strip off tracking URL query params. To dedupe products, you’ll have to use something more specific than the product URL.

How to dedupe depends on your application. It’s entirely possible for the exact same product to be sold by multiple sellers. You might look for ISBN or SKU for some kinds of products, or something like the primary product image URL or a hash of the primary image.

17. Avoid Loading Detail Pages

This realization helped me make the crawler 10-12x faster, and much simpler. I realized that I could grab all of the product information I needed from the subcategory listing view, and didn’t need to load the full URL to each of the products’ detail page.

I was able to grab 10-12 products with one request, including each of their titles, URLs, prices, ratings, categories and images – instead of needing to make a request to load each product’s detail page separately.

Whether you need to load the detail page to find more information like the description or related products will depend on your application. But if you can get by without it, you’ll get a pretty nice performance improvement.

18. Cloudfront has no Rate Limiting for Amazon.com Product Images

While I was using a list of 500 proxy servers to request the product listing URLs, I wanted to avoid downloading the product images through the proxies since it would chew up all my bandwidth allocation.

Fortunately, the product images are served using Amazon’s CloudFront CDN, which doesn’t appear to have any rate limiting. I was able to download over 100,000 images with no problems – until my EC2 instance ran out of disk space.

Then I broke out the image downloading into its own little python script and simply had the crawler store the URL to the product’s primary image, for later retrieval.

19. Store Placeholder Values

There are lots of different types of product pages on Amazon. Even within one category, there can be several different styles of HTML markup on individual product pages, and it might take you a while to discover them all.

If you’re not able to find a piece of information in the page with the extractors you built, store a placeholder value like “<No Image Detected>” in your database.

This allows you to periodically query for products with missing data, visit their product URLs in your browser and find the new patterns. Then you can pause your crawler, update the code and then start it back up again, recognizing the new pattern that you had initially missed.


How My Finished, Final Code Works

TL;DR: Here’s a link to my code on github. It has a readme for getting you setup and started on your own amazon.com crawler.

Once you get the code downloaded, the libraries installed and the connection information stored in the settings file, you’re ready to start running the crawler!

If you run it with the “start” command, it looks at the list of category URLs you’re interested in, and then goes through each of those to find all of the subcategory URLs that are listed on those page, since paginating through each category is limited (see lesson #15, above).

It puts all of those subcategory URLs into a redis queue, and then spins up a number of threads (based on ) to process the subcategory URLs. Each thread pops a subcategory URL off the queue, visits it, pulls in the information about the 10-12 products on the page, and then puts the “next page” URL back into the queue.

The process continues until the queue is empty or has been reached.

Note that the crawler does not currently visit each individual product page since I didn’t need anything that wasn’t visible on the subcategory listing pages, but you could easily add another queue for those URLs and a new function for processing those pages.


Hope that helps you get a better sense of how you can conduct a large scrape of amazon.com or a similar ecommerce website.

If you’re interested in learning more about web scraping, I have an online course that covers the basics and teaches you how to get your own web scrapers running in 15 minutes.

Sours: https://blog.hartleybrody.com/scrape-amazon/


22127 22128 22129 22130 22131