[LeetCode] Web Crawler

1236. Web Crawler

Given a url startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Return all urls obtained by your web crawler in any order.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.
  • Do not crawl the same link twice.
  • Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

1
2
3
4
interface HtmlParser {
// Return a list of all urls from a webpage of given url.
public List<String> getUrls(String url);
}

Below are two examples explaining the functionality of the problem, for custom testing purposes you’ll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/**
* // This is the HtmlParser's API interface.
* // You should not implement it, or speculate about its implementation
* class HtmlParser {
* public:
* vector<string> getUrls(string url);
* };
*/

class Solution {
string getHost(string& url) {
int i = 7;
string host = "";
while(i < url.length() and url[i] != ':' and url[i] != '/') {
host.push_back(url[i++]);
}
return "http://" + host;
}
public:
vector<string> crawl(string startUrl, HtmlParser htmlParser) {
vector<string> res{startUrl};
queue<string> q;
unordered_set<string> vis{startUrl};
q.push(startUrl);
string host = getHost(startUrl);
while(!q.empty()) {
auto url = q.front(); q.pop();
for(auto& near : htmlParser.getUrls(url)) {
if(!vis.count(near) and getHost(near) == host) {
vis.insert(near);
q.push(near);
res.push_back(near);
}
}
}
return res;
}
};
Author: Song Hayoung
Link: https://songhayoung.github.io/2022/06/13/PS/LeetCode/web-crawler/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.