Extracting Product Info from Generic Web Sites with no additional configuration

This models shows how NLP based information extraction can be added to your processing pipeline with no additional configuration. Simply run it on your input distributor website, and you'll have a structured document as output that contains relevant product information.

Problem

The client wanted to get structured information in an excel file about products from different distributors’ websites with different website structures. Extract data fields like Product name, Product image, ASIN, Category, Seller name, Product cost, Shipping cost, QTY, Star, Reviews, and many more. We could not use the standard website crawling techniques, because the websites were very different, and the engineers had to create rules for each website separately. One website could have from 100 to 50000 products.

Solution

We trained a NER model which took as input de HTML from a product webpage and returned a JSON object with all the data required without any configuration. We were able to train this model on an existing pricelist from different distributors. Our model was trained on 10 different websites, and 1000 more generated examples using our name entities text generation library. Our model was tested on 20 more websites and achieved an accuracy of 94% per field.