IRC log of dig on 2009-05-22
Timestamps are in UTC.
- 11:46:47 [DIGlogger]
- DIGlogger (n=dig-logg@groups.csail.mit.edu) has joined #dig
- 11:46:47 [pratchett.freenode.net]
- topic is: Decentralized Information Group @ MIT http://dig.csail.mit.edu/
- 11:46:47 [pratchett.freenode.net]
- Users on #dig: DIGlogger RalphS jsoltren drrho pheny Tristan ericP kennyluck sandro djbclark
- 11:47:38 [drrho]
- drrho has quit (Remote closed the connection)
- 11:49:10 [drrho]
- drrho (n=rho@chello213047112079.11.11.vie.surfer.at) has joined #dig
- 13:08:28 [fuming]
- fuming (n=fuming@30-16-242.dynamic.csail.mit.edu) has joined #dig
- 13:23:32 [lkagal]
- lkagal (n=lkagal@33.68.171.66.subscriber.vzavenue.net) has joined #dig
- 13:36:24 [danbri]
- danbri (n=danbri@85.144.208.21) has joined #dig
- 13:59:47 [danbri_]
- danbri_ (n=danbri@s5590d015.adsl.wanadoo.nl) has joined #dig
- 14:00:10 [danbri]
- danbri has quit (Read error: 110 (Connection timed out))
- 14:03:16 [drrho]
- drrho has quit (Remote closed the connection)
- 14:04:39 [drrho]
- drrho (n=rho@chello213047112079.11.11.vie.surfer.at) has joined #dig
- 14:35:44 [fuming]
- fuming has quit ()
- 14:36:18 [fuming]
- fuming (n=fuming@30-16-242.dynamic.csail.mit.edu) has joined #dig
- 14:38:06 [fuming]
- fuming has quit (Client Quit)
- 14:38:44 [fuming]
- fuming (n=fuming@30-16-242.dynamic.csail.mit.edu) has joined #dig
- 14:39:21 [fuming_]
- fuming_ (n=fuming@30-16-242.dynamic.csail.mit.edu) has joined #dig
- 14:55:13 [fuming]
- fuming has quit (Read error: 110 (Connection timed out))
- 14:55:23 [fuming_]
- fuming_ has quit (Read error: 110 (Connection timed out))
- 15:05:11 [danbri_]
- danbri_ is now known as danbri
- 15:21:07 [lkagal]
- lkagal has quit ()
- 16:01:39 [lkagal]
- lkagal (n=lkagal@30-6-179.wireless.csail.mit.edu) has joined #dig
- 16:14:00 [jsoltren]
- jsoltren has quit ("Leaving.")
- 16:31:16 [jsoltren]
- jsoltren (n=jsoltren@w3cdhcp13.w3.org) has joined #dig
- 16:32:18 [fuming]
- fuming (n=fuming@30-16-242.dynamic.csail.mit.edu) has joined #dig
- 16:40:14 [danbri]
- danbri has quit ("going back to danbri.org")
- 16:57:06 [lkagal]
- lkagal has quit ()
- 16:57:45 [jsoltren1]
- jsoltren1 (n=jsoltren@30-9-250.wireless.csail.mit.edu) has joined #dig
- 16:59:27 [lkagal]
- lkagal (n=lkagal@30-6-179.wireless.csail.mit.edu) has joined #dig
- 17:01:24 [lkagal]
- David's presentation starting in 32-G631 now ...
- 17:05:18 [sOpen]
- sOpen (n=ds@30-5-221.wireless.csail.mit.edu) has joined #dig
- 17:05:47 [jnpato]
- jnpato (n=jnp-irc@pool-173-76-205-53.bstnma.fios.verizon.net) has joined #dig
- 17:05:55 [sOpen]
- http://web.mit.edu/dsheets/Public/erachnid-slides.pdf
- 17:07:00 [fuming_]
- fuming_ (n=fuming@30-5-228.wireless.csail.mit.edu) has joined #dig
- 17:08:04 [jsoltren1]
- dsheets: Why do we run crawlers? Google - indexing services. Research services (YouTomb), archivers (Internet Archive), and Semantic Web data generators.
- 17:09:51 [jsoltren1]
- ... what sorts of challenges? Overlapping data sets, refreshing changed data sets, dynamic content from arguments to URLs, managing size and throttling, prioritizing important data (don't download what you can't index), system design issues.
- 17:12:21 [jsoltren1]
- ... Why Erlang? A learning example (chuckle). Some unique features: lightweight processes that are lighter than system threads or processes, near zero startup time. Shared-nothing concurrency using a process queue. Easily distributable since everything is based on message passing encapsulated in TCP (Design choice). Plus, there are lots of packages: OTP, mnesia [sic] database, ACID transactions across a network, the "perfect" database.
- 17:12:38 [jsoltren1]
- ... only about 600 lines of Erlang code for this crawler!
- 17:13:06 [jsoltren1]
- ... three basic components: persistent fetchers and queue servers, and extractors (only spawned when needed).
- 17:13:39 [jsoltren1]
- ... fetcher fetches data, spawns processes. Launched to pull down individual Web resources, which go into mnesia table.
- 17:14:17 [jsoltren1]
- ... queue server handles throttling, one domain per node. Extractor reads through binary data. Doesn't parse (malformed HTML?), just regexp searches.
- 17:14:55 [jsoltren]
- jsoltren has quit (Read error: 110 (Connection timed out))
- 17:15:45 [jsoltren1]
- ... for Semantic Web data: parse SW formats, add more metadata to mnesia database.
- 17:16:04 [jsoltren1]
- ... only need to modify extractor.
- 17:16:36 [jsoltren1]
- ... queue server: nested domains, more queues.
- 17:17:10 [jsoltren1]
- lkagal: could each queue be in a different node?
- 17:18:22 [jsoltren1]
- dsheets: fetcher only pops from local node, every node contains a different set of domain queues. I.e. mit.edu on node A. yahoo.com on node B. Throttling and everything for local representation of domain queues is handled by queue servers.
- 17:18:39 [jsoltren1]
- ... how to make system faster? look at queue server.
- 17:19:47 [fuming]
- fuming has quit (Read error: 110 (Connection timed out))
- 17:20:20 [jsoltren1]
- ... preliminary testing with Linux boxes on MIT network. These speeds aren't amazing. Fairly steady rate but a third of wha tyou can achieve on 1999 technology.
- 17:20:34 [jsoltren1]
- ... some things can run in C and run faster.
- 17:20:44 [jsoltren1]
- ... near linear speedup as you add more nodes.
- 17:20:55 [jsoltren1]
- Where was the bottleneck?
- 17:21:03 [jsoltren1]
- dsheets: Don't know. didn't look closely.
- 17:22:25 [jsoltren1]
- ... Mnesia really helped us, by fragmenting the tables. Very little crossover. If only going to a depth of 8, there is not that much between the different nodes.
- 17:22:38 [jsoltren1]
- ... embarassingly parallel.
- 17:22:51 [fuming_]
- fuming_ has quit ()
- 17:23:43 [jsoltren1]
- ... for scaling, we need better algorithm. No balancing at all. Can't add nodes on the fly and have them be re-assigned, though you could add new seed URLs to new nodes on the fly. You can split a node. You'd like this to happen automatically.
- 17:24:36 [jsoltren1]
- ... how would you extend Erlang? c, python, ruby, and JVM can all emulate the Erlang node protocol and can send messages.
- 17:24:53 [jsoltren1]
- ... algorithms and strategies could be added. (First nontrivial Erlang program I wrote.)
- 17:25:12 [jsoltren1]
- ... data processing: pulls down lots of data but doesn't do anything with it. Need indexing to build a search engine, for example.
- 17:25:50 [jsoltren1]
- ... future work: I did things sanely at the time, but didn't go back. Made all functions tail recursive. We could do profiling and optimization, clean the code, and create a library of commonly used crawling features.
- 17:26:05 [jsoltren1]
- ... right now, basically compares URLs directly instead of comparing data.
- 17:26:18 [fuming]
- fuming (n=fuming@30-5-228.wireless.csail.mit.edu) has joined #dig
- 17:26:57 [jsoltren1]
- ... this is only the beginning. A production system would need far more attention. Lots of power here hidden behind cruft. Based on Prolog so nonstandard line terminators. I think you could build a real webcrawler with these tools.
- 17:27:25 [jsoltren1]
- ... queue server has three API functions: push, pop, split (for queue)
- 17:27:59 [jsoltren1]
- ... when starved for new URLs, extractor processes pull resources from database, pull URLs from queue server which waits for fetcher to push URLS out. Every 5s fetchers pop something new from queue server.
- 17:28:16 [jsoltren1]
- ... once the pipeline is going everything works fairly well.
- 17:28:28 [jsoltren1]
- If I'm an extractor and see a new link, do I ask about it or just pull it?
- 17:28:40 [jsoltren1]
- s/extractor/fetcher/
- 17:28:53 [jsoltren1]
- ... fetcher just pulls down whatver.
- 17:29:12 [jsoltren1]
- Does the same thread facilitate de-duping requests and URLS?
- 17:29:41 [jsoltren1]
- dsheets: extractor does de-duping. you could put a trigger in there to increment something in matrix, but right now we just toss them.
- 17:29:59 [jsoltren1]
- ... extractor queries database. queues live in memory, resources live in database.
- 17:30:27 [jsoltren1]
- ... can configure: RAM, disk, and disk-only copies. disk copies guarantees normal persistence transactional semantics.
- 17:31:28 [jsoltren1]
- When you say there is no shared data, what do you mean exactly?
- 17:32:13 [jsoltren1]
- dsheets: everything is message passing. Can ship across network or memory. Erlang/OTP provides: genserver, genevent, genfsm, which abstract away all message passing.
- 17:33:34 [jsoltren1]
- ... could implement memory sharing using C but this provides uptime issues. Erlang is really designed for real-time systems like phone switching, Jabber, email, and not web crawlers! but you can write one.
- 17:33:44 [jsoltren1]
- lkagal: Demo?
- 17:34:00 [jsoltren1]
- dsheets: sure.
- 17:34:11 [fuming]
- fuming has quit ()
- 17:34:38 [jsoltren1]
- ... warning! it's a demo! do not run for real, there are bugs!
- 17:37:09 [jsoltren1]
- Is code compiled or interpreted?
- 17:37:13 [jsoltren1]
- dsheets: either/or.
- 17:38:07 [fuming]
- fuming (n=fuming@30-5-228.wireless.csail.mit.edu) has joined #dig
- 17:38:09 [jnpato]
- jnpato has quit (Read error: 60 (Operation timed out))
- 17:41:11 [jnpato]
- jnpato (n=jnp-irc@30-6-47.wireless.csail.mit.edu) has joined #dig
- 17:42:15 [jsoltren1]
- ... does not follow robots.txt.
- 17:44:43 [jsoltren1]
- lalana: how would you customize to run on different nodes?
- 17:45:06 [jsoltren1]
- dsheets: different nodes, mnesia db for each node, que server and fetcher, and share secret cookie. then everything is automatic.
- 17:45:15 [jsoltren1]
- ... epmd global name service.
- 17:45:26 [jsoltren1]
- fuming: where can we find all of the nodes?
- 17:46:03 [jsoltren1]
- lalana: a node is any computer, each with a different instance. wanted to see how hard it would be to get more machines and have them collaborate.
- 17:46:08 [fuming]
- fuming has quit ()
- 17:46:30 [jsoltren1]
- fuming: could you use something like this on amazon's ec2 service? (cloud computing)
- 17:46:42 [jsoltren1]
- dsheets: sure, but as is this isn't very efficient.
- 17:47:01 [jsoltren1]
- thank you!
- 17:47:02 [lkagal]
- lkagal has quit ()
- 17:47:09 [jsoltren1]
- lalana: More agenda items.
- 17:47:15 [jsoltren1]
- Is everyone here in the summer?
- 17:47:26 [jsoltren1]
- > Yes, meetings will be at 2:30p on Thursdays.
- 17:47:35 [jsoltren1]
- Meeting adjourned.
- 17:52:41 [fuming]
- fuming (n=fuming@dhcp-18-111-6-113.dyn.mit.edu) has joined #dig
- 17:54:33 [lkagal]
- lkagal (n=lkagal@30-6-179.wireless.csail.mit.edu) has joined #dig
- 17:58:15 [jsoltren]
- jsoltren (n=jsoltren@w3cdhcp13.w3.org) has joined #dig
- 18:03:11 [jnpato_]
- jnpato_ (n=jnp-irc@pool-173-76-205-53.bstnma.fios.verizon.net) has joined #dig
- 18:03:49 [sOpen]
- sOpen has quit (Read error: 110 (Connection timed out))
- 18:13:10 [jnpato]
- jnpato has quit (Read error: 110 (Connection timed out))
- 18:16:24 [jsoltren1]
- jsoltren1 has quit (Read error: 110 (Connection timed out))
- 18:28:39 [fuming]
- fuming has quit ()
- 18:38:09 [drrho]
- drrho has quit (Remote closed the connection)
- 19:11:11 [oshani]
- oshani (n=oshani@w3cdhcp4.w3.org) has joined #dig
- 19:11:40 [oshani]
- oshani has quit (Client Quit)
- 19:13:48 [jnpato_]
- jnpato_ has quit ()
- 19:46:57 [sOpen]
- sOpen (n=ds@SENIOR-FOUR-SIXTY-SEVEN.MIT.EDU) has joined #dig
- 20:07:05 [lkagal]
- lkagal has quit ()
- 20:11:07 [oshani]
- oshani (n=oshani@w3cdhcp4.w3.org) has joined #dig
- 20:21:07 [RalphS]
- RalphS has quit ("bye for today")
- 21:07:12 [fuming]
- fuming (n=fuming@dhcp-18-111-6-113.dyn.mit.edu) has joined #dig
- 21:10:01 [sOpen]
- sOpen has quit (Read error: 110 (Connection timed out))
- 21:13:05 [fuming]
- fuming has quit ()
- 21:28:02 [lkagal]
- lkagal (n=lkagal@33.68.171.66.subscriber.vzavenue.net) has joined #dig
- 21:35:45 [lkagal]
- lkagal has quit ()
- 22:18:18 [lkagal]
- lkagal (n=lkagal@33.68.171.66.subscriber.vzavenue.net) has joined #dig
- 22:22:27 [lkagal]
- lkagal has quit (Client Quit)
- 22:28:34 [Valdemarick]
- Valdemarick (n=valdemar@h66.15.102.166.static.ip.windstream.net) has joined #dig
- 22:28:39 [Valdemarick]
- Valdemarick has left #dig
- 22:30:30 [jsoltren]
- jsoltren has quit ("Leaving.")
- 22:41:25 [jsoltren]
- jsoltren (n=jsoltren@WILG-ONE-SIXTY.MIT.EDU) has joined #dig
- 23:07:22 [lkagal]
- lkagal (n=lkagal@33.68.171.66.subscriber.vzavenue.net) has joined #dig
- 23:09:27 [lkagal]
- lkagal has quit (Client Quit)