Hemenway K., Calishain T. — Spidering Hacks :: Электронная библиотека попечительского совета мехмата МГУ

Главная Ex Libris Книги Журналы Статьи Серии Каталог Wanted Загрузка ХудЛит Справка Поиск по индексам Поиск Форум

Авторизация

Поиск по указателям

Красота

Hemenway K., Calishain T. — Spidering Hacks

Hemenway K., Calishain T. — Spidering Hacks

Обсудите книгу на научном форуме

Нашли опечатку?
Выделите ее мышкой и нажмите Ctrl+Enter

Название: Spidering Hacks

Авторы: Hemenway K., Calishain T.

Аннотация:

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you. Spidering Hacks takes you to the next level in Internet data retrieval — beyond search engines — by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented — you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you. Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

* Aggregate and associate data from disparate locations, then store and manipulate the data as you like
* Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
* Integrate third-party data into your own applications or web sites
* Make your own site easier to scrape and more usable to others
* Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day

Язык:

Рубрика: Технология/

Статус предметного указателя: Готов указатель с номерами страниц

ed2k: ed2k stats

Год издания: 2003

Количество страниц: 424

Добавлена в каталог: 15.06.2007

Операции: Положить на полку | Скопировать ссылку для форума | Скопировать ID

Предметный указатель

rsync, mirroring web sites with
SafeSearch filtering mechanism
sales statistics, publishing Amazon.com Associates
Sample
saving daily horoscopes on
scattersearching
scheduling tasks without
scheduling tasks without cron
scrapers and spiders, difference between
Scraping
scraping competitive data
scraping with
Script Schedule
scripts, adding progress bars to
Search engine robots web site
search form
search request program
search request to
search results
Searching
searching code
searching for authors
searching instead of ISBN
searching LOC call numbers instead of
Seattle's King County database of restaurant inspections
secured access and browser attributes
sending
Shared RRD module
shell scripts
Sifry, Dave
signatures, software
simulating a POST
Six Degrees of Kevin Bacon
Slashcode
Slashdot
sleep statement (Perl)
SMIL (Synchronized Multimedia Integration Language) files
SOAP-based Google Web Services API
SOAP::Lite package
software for
software packages
Sort::Array module
spaces
specific information, locating and gathering
Spider-Man theme song
spidering
Spiders
sprintf
stock prices, collecting
structure of
Synchronized Multimedia Integration Language (SMIL) files
Syndic8
syndicated news feeds
syndication
Tang, Autrijus
Technorati
Technorati and
Template Toolkit
Template::Extract
Template::Extract module
Template::Generate
Template::Generate module
Templates
Term::ProgressBar
Term::ProgressBar module
Terms of Service (TOS)
Terms of Use (TOU)
Testing
text      [See content]
Text::Diff
Text::Diff module
Text::Template
Text::Template module
that track legitimate spiders
Thesaurus
Thesaurus.com
Time::JulianDay
Time::JulianDay module
titles
to find data including friends or recommendations
to get book metadata and weblog mentions
Toftum, Mads (contributor)
tools, using correct
Top 20 searching
Top 20 searching on Google
TOS (Terms of Service)
TOU (Terms of Use)
Tracking      [See Link Cosmos]
tracking additions to
tracking packages with FedEx
tracking search results
traffic statistics, agregating
train connections, finding faster
TREE      [See Favorites tree]
Trees
trendspotting with geotargeting
Truskett, Iain (contributor)
turning into positions

TV Guide Online
TV listings
TV listings, scraping
tvlisting
U.S. Census web site
Udell, Jon
Unicode.org
United States Post Office
UNIX
Unix and Mac OS X installation
Unix and Mac OS X installation of Perl
URI
URI module      2nd
URI::Escape
URI::Escape module
URLs
Usenet, downloading from
User Agent Database web site
using established universal taxonomy
using existing programs
using good
using regular expressions
using specific
using to automate tasks
using to find related web sites
using to repurpose data
using to scrape across multiple domains
utilities
utilities, running multiple
VersionTracker
virtual browsers
visual indicators
visual indicators when downloading
Vitiello, Eric (contributor)
VoodooPad
watching printers
Weather Underground
weather, identifying visitor's
Weather::Underground
Weather::Underground module
Web Robots Database web site
Web Scraping Proxy
Web Services
Web Services API, SOAP-based
Web Services ASIN query
Web sites
web sites having problems with use of
web sites that track legitimate
webcams, archiving
weblog
weblog-free Google results
weblog-free results
weblogs      [See also blogs]
Webmaster World      2nd
Webshots
wget utility
what is indexed
What's New page
who may not want to be spidered
why use this technology
Win32::Sound
Win32::Sound module
Windows
Windows and Perl
Wired Bots
with cron
with LWP::Simple
with REST interface
with XML-RPC
word lookup
WWW::Mechanize      [See WWW::Mechanize module]
WWW::Mechanize module      2nd
WWW::Yahoo::Groups
WWW::Yahoo::Groups module
XBox games
Xerces for Java
XHTML (Extensible Hypertext Markup Language) files
XML
XML (Extensible Markup Language) files
XML and
XML-RPC
XML::LibXML
XML::LibXML module
XML::RSS      [See XML::RSS module]
XML::RSS module
XMLRPC::Lite      [See XMLRPC::Lite module]
XMLRPC::Lite module      2nd
XPath
Yahoo!
Yahoo! Buzz
Yahoo! Catalog, spidering
Yahoo! Groups
Yahoo!'s news photo archive
yahoo2mbox
Zeitgeist page
Zeitlin, Vadim
Zip codes
Zope

1 2 3

Реклама

© Электронная библиотека попечительского совета мехмата МГУ, 2004-2026

Электронная библиотека мехмата МГУ

Valid HTML 4.01!

|

Valid CSS!

О проекте