tech 4 all: Web crawler

Wednesday, August 23, 2017

Web crawler

Do you know what are they? They are Hardware or Software?

Well, a Web crawler, sometimes called a spider, is an Internet bot that
systematically browses the World Wide Web, typically for the purpose
of Web indexing (web spidering).

Web search engines and some other sites use Web crawling or spidering
software to update their web content or indices of others sites' web
content. Web crawlers can copy all the pages they visit for later
processing by a search engine which indexes the downloaded pages so
the users can search much more efficiently.

The web is like an ever-growing library with billions of books and
no central filing system. Software known as web crawlers is used to
discover publicly available webpages. Crawlers look at webpages and
follow links on those pages, much like you would if you were browsing
content on the web. They go from link to link and bring data about those
webpages back to servers.

When the spider looked at an HTML page, it took note of two things:

The words within the page
Where the words were found

Words occurring in the title, subtitles, meta tags and other positions
of relative importance were noted for special consideration during a
subsequent user search. The spider is built to index every significant
word on a page, leaving out the articles "a," "an" and "the."