1, Feed RSS;
2, Websites’ APIs;
3, Web-scraping; With PHP, can use curl; with java, can use HttpClient.
1, Feed RSS;
2, Websites’ APIs;
3, Web-scraping; With PHP, can use curl; with java, can use HttpClient.
I had the problem when I ran the command in nutch 1.3:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
The error shows:
no agents listed in ‘http.agent.name’ property
I added the value for http.agent.name for nutch-site.xml and nutch-default.xml but still had the same problem.
I searched on the Internet for help. I tried a few methods I got. At last, I solved the problem.
I need to add the value to the file “nutch-site.xml”, “nutch-default.xml” under the folder: runtime/local, not the files under the root folder.