Skip to main content
Article
Russian National Corpus Web Scraping Project 2019-2020
Russian-Verb - Russian National Corpus- Mining
  • Perry B. Koob
  • Irina V. Ivliyeva, Missouri University of Science and Technology
Abstract

Many web sites, in particular ones that serve content from a content management system or database, deliver their content as HTML with an underlying computer generated structure that is then visually formatted and styled using Cascading Style Sheets (CSS) and JavaScript.

Additionally, when the website uses a web form to query and return results, the web address is read by the web server or application server, the address is then parsed for parameters, and the parameters are passed to the database behind the website which control the results returned.

There are techniques that utilize these facts to extract large amounts of data from the backend database behind a website through a series of crafted web page requests.

Collectively these techniques are called Web Scraping.

Two of the key techniques of web scraping are URL Hacking and HTML parsing.

Department(s)
Arts, Languages, and Philosophy
Document Type
Technical Report
Document Version
Final Version
File Type
text
Language(s)
English
Language 2
Russian
Rights
© 2020 The Authors, All rights reserved.
Publication Date
2-1-2020
Publication Date
01 Feb 2020
Disciplines
Citation Information
Koob, P., Ivliyeva, Russian National Corpus Web Scraping Project 2019-2020. Missouri S&T, IT and ALP departments. [Electronic resource].