Web scraping solution for tackling unfair competition

See how GroupBWT assisted a leading law firm in tackling violations of online sales and unfair competition by providing high-end web scraping services

single cases background

The Client Story

Two years ago GroupBWT engaged in a long-term collaborative project with a major legal player in the US market. The project’s aim is to collect data from Walmart and Amazon. This data relates to the extensive lists of selected keywords and products.

Technology used: Laravel, Scrapy Python, Puppeteer, MySQL, mSQL, RabbitMQ

Industry: Retail and e-commerce
Cooperation: Since 2018
Location: USA

We streamlined the process allowing data scraping to be executed in 100 -150 streams simultaneously. We ended up extracting data for up to 4 million products from Walmart

We synchronized with the external Azure SQL database, and launched the sync. Overall, 20 mln reviews were successfully collected from Amazon

Introduction

Brands strive to be protected online by getting their sales controlled.

High level competition has led to the adoption of numerous technological solutions based on data. Over the past decade, brands have undergone a digital transformation, going omnichannel, closing brick and mortar stores, and switching their sales to the online channels. If you don’t have an online presence, the future of your brand or company may be at a disadvantage. However, growth of online presence has led to the competiiton violation. The client tied with us to get a solution to combat unauthorized sellers, control and grow online sales, achieve MAP compliance, eliminate channel conflicts, and protect brand value and customer experience.

single cases content
The Solution

We devised multiple scrapers, and built an admin panel to interact with the client.

This allowed us to exchange data in a more efficient way. The scraping was triggered on keywords the client was uploading into the admin panel. Our job was to scrape sellers and products related to these keywords.That allowed us to scrape 4 million products from Walmart and 20 mln reviews from Amazon platforms. Scraping of giant platforms like Walmart and Amazon is not just a tough nut to crack due to the amount of products and pages, but also because such websites adopt strict measures to limit the practice of scraping. It is not always clear when or if a process is delivering, as the product and catalogue pages differ in their structure and can confuse the scraper logic.

The challenge was not to build just a crawler, but a crawler that would run smoothly due to the vast amount and variety of input data that it would be exposed to. This crawler needed to be highly resilient, this was achieved by applying a combination of request scheduling techniques and IP rotation. This was to avoid the identifiable bot behavior patterns. Listed below are some precautionary measures we followed throughout the process:

  • IP randomization

 • P addresses that are within the reasonable proximity from the store

  • Keeping the chosen IPs for the scraping session

  • The proxy pool changes every 24 hours.

Walmart applies the AJAX technique to the pagination button, so we made the algorithm taking the loading process as the cue to start

single cases content

We streamlined the process allowing data scraping to be executing in 100 -150 streams simultaneously. This allowed us to collect 20 million customer reviews from Amazon within the duration of the project. For Walmart scraping, we did more than 1000 and ended up extracting data for up to 4 million products from the website. Pagination was repeated for over 1000 times per each provided keyword.

avatar
Alex Yudin
Web Scraping Team Lead
The Result

Our client has successfully launched an eControl service for its clients, and currently it is helping dozens of US brands to stay

The data pipeline is used to enable company’s legal investigation of unfair sales retail practicesAs a result, the client has been using the collected data to counter unfair competition for big brands, and prevent their erosion as a result of damping prices. Working with us ensured they received the stable flow of fresh, quality data on provided keywords, products, and suppliers. Our cooperation is still ongoing.

single cases content