How Artificial Intelligence Can be Applied to Web Data Extraction

Artificial intelligence is not a new topic at all. A lot has been written about it and it has been a popular theme of sci-fi movies from a decade ago. However, it was only recently that we started seeing AI in action. Thanks to the ever-increasing computing power, our machines are much faster and powerful now which also gives a huge boost to AI. It goes without saying that artificial intelligence requires more computing power to be truly intelligent and mimic the human brain.

AI is finding its way into many everyday objects that we use. The voice assistant apps on your smartphone are a great example for this. Facebook’s face recognition algorithm is another example for intelligent pattern recognition technology in action. We believe that the extraction of data from web is something that humans shouldn’t be burdened with. Artificial intelligence could be the right solution to aggregating huge data sets from the web with minimal manual interference.

Artificial Intelligence VS Machine Learning

There is a stark difference between machine learning and artificial intelligence. In machine learning, you teach the machine to do something within narrowly defined rules along with some training examples. This training and rules are necessary for the machine learning system to achieve some level of success in the process it’s being taught. Whereas, in artificial intelligence, it does the teaching itself with minimal number of rules and loose training.  It can then go on to make rules for itself from the exposure that it gets, which contributes to the continued learning process. This is made possible by using artificial neural networks. Artificial neural networks and deep learning are used in artificial intelligence for speech and object recognition, image segmentation, modeling language and human motion.

Artificial intelligence in web data extraction

The web is a giant repository where data is vast and abundant. The possibilities that come with this amount of data can be ground breaking. The challenge is to navigate through this unstructured pile of information out there on the web and extract it. It takes a lot of time and effort to scrape data from the web, even with the advanced web scraping technologies. But things are about to change. Researchers from the Massachusetts Institute of Technology recently released a paper on an artificial intelligence system that can extract information from sources on the web and learn how to do it on its own.

The research paper introduces an information extraction system that can extract structured data from unstructured documents automatically. To put it simply, the system can think like humans while looking at a document. When humans cannot find a particular piece of information in a document, we find alternative sources to fill the gap. This adds to our knowledge on the topic in question. The AI system works just like this.
The AI system works on rewards and penalties

The working of this AI based data extraction system involves classifying the data with a ‘Confidence score’. This confidence score determines the probability of the classification being statistically correct and is derived from the patterns in the training data. If the confidence score doesn’t meet the set threshold, the system will automatically search the web for more relevant data. Once the adequate confidence score is achieved by extracting new data from the web and integrating it with the current document, it will deem the task successful. If the confidence score is not met, the process continues until the most relevant data has been pulled out.

This type of learning mechanism is called ‘Reinforcement learning’ and works by the notion of learning by reward. It’s very similar to how humans learn. Since there can be a lot of uncertainty associated with the data being merged together, especially where contrasting information is involved, the rewards are given based on the accuracy of the information. With the training provided, the AI learns how to optimally merge different pieces of data together so that the answers we get from the system is as accurate as possible.
AI in action

To test how well the artificial intelligence system can extract data from the web, researchers gave it a test task. The system was to analyse various data sources on mass shootings in the USA and extract the name of the shooter, number of injured, fatalities and the location. The performance was in fact mind blowing as it could pull up the accurate data the way it was needed while beating conventionally taught data extraction mechanisms by more than 10 percent.

The future of data extraction

With ever increasing need for data and the challenges associated with acquiring it, AI could be what’s missing in the equation. The research is promising and hints at a future where intelligent bots with human sight can read and crawl web documents to tell us the bits we need to know.

The AI system could be a game changer in research tasks that require a lot of manual work from humans now. A system like this will not only save time but also enables us to make use of the abundance of information out there on the web. Looking at the bigger picture, this new research is only a step towards creating the truly intelligent web spider that can master a variety of tasks just like humans rather than being focused at just one process.


