You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 
JustAnotherArchivist ba5a520c21
Merge pull request #544 from JustAnotherArchivist/ignore-fc2-blog-2nt
3 weeks ago
bot Add !reason alias for !explain and --reason for --explain 1 year ago
cogs Load igsets and UAs into CouchDB on cogs start 2 years ago
config Introduce Cucumber for integration testing. 9 years ago
dashboard dashboard: mention the Chromium timer throttling issue 7 months ago
db Add 2NT's blogging platform to fc2-blog (same backend software as FC2) 3 weeks ago
doc Add !reason alias for !explain and --reason for --explain 1 year ago
lib Notify IRC channel on pipeline changes 3 years ago
ops Add a hacky script for clearing a job's cookie jar 2 years ago
pipeline Fix compatibility with PyYAML 6.0 (mandatory `Loader`) 7 months ago
plumbing Add some error handling 3 years ago
spec Remove PhantomJS support 3 years ago
test Remove RSYNC_URL environment variable for pipeline 3 years ago
uploader Fix syntax warnings in uploader 2 years ago
viewer Add link to archivelab WARC viewer 7 years ago
.gitignore tests+travis: Add db/ JSON validation 5 years ago
.gitmodules redis-lua is no longer required. 9 years ago
.travis.yml Fix get-pip.py URL for Python 3.5 tests 2 years ago
Gemfile Fix webmachine-sprockets dependency 2 years ago
Gemfile.lock Fix webmachine-sprockets dependency 2 years ago
INSTALL.backend New dashboard WebSocket server 3 years ago
INSTALL.pipeline Support weaker SSL/TLS connections for a broader compatibility with outdated web servers 3 years ago
LICENSE Relicense as MIT. 9 years ago
README Updated grab-site repo URL in README 4 years ago
README.pipeline-recovery Add readme file for how to save off as much data as possible if a 2 years ago
Rakefile Remove pointless features. 8 years ago

README

1. ArchiveBot

<SketchCow> Coders, I have a question.
<SketchCow> Or, a request, etc.
<SketchCow> I spent some time with xmc discussing something we could
do to make things easier around here.
<SketchCow> What we came up with is a trigger for a bot, which can
be triggered by people with ops.
<SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
archive.org. Boom.
<SketchCow> I can supply machine as needed.
<SketchCow> Obviously there's some sanitation issues, and it is root
all the way down or nothing.
<SketchCow> I think that would help a lot for smaller sites
<SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
simple.
<SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling. ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide: http://archivebot.readthedocs.org/en/latest/
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. Local use

ArchiveBot was originally written as a set of separate programs for
deployment on a server. This means it has a poor distribution story.
However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,
dashboard, ignores, and control system and created a package intended for
personal use. You can find it at https://github.com/ArchiveTeam/grab-site.

4. License

Copyright 2013 David Yip; made available under the MIT license. See
LICENSE for details.

5. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget. Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <http://celluloid.io/>
* Cinch <https://github.com/cinchrb/cinch/>
* CouchDB <http://couchdb.apache.org/>
* Ember.js <http://emberjs.com/>
* Redis <http://redis.io/>
* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>

6. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

vim:ts=2:sw=2:tw=72:et