Search engine research

From Liebel-lab

Jump to: navigation, search

Bioinformatic Harvester IV integrating knowledge (Kindler / Liebel)


BioInformatic Harvester (magnify)

"Bioinformatic Harvester IV" is a one-stop portal for major protein/gene resources.
Currently > 36 database are integrated and cross-linked.
A convenient "Google-like" search interface allows "real-time" data queries.

New Harvester IV features

  • tabbed navigation allows convenient browsing
    • new categories: overview,BLAST, expression,gene-view,global,
    • new categories: literature, networks, products, protein domains
    • new categories: sequence, special, uniprot
  • New ultra fast search engine
  • Web-based maintenance systems allows novel DB integration in < 5mins
  • tabbed navigation allows convenient browsing
  • allows gene/protein/sequence information to be cross-linked

Existing Harvester III features

  • Integrates human, mouse, rat, zebrafish, arabidopsis and drosophila information
  • Harvested information is always up-to-date (via "iframe" technology")
  • Harvester serves ~10.000 pages to the scientific community every day
  • Modular design: Novel databases are continuously integrated
  • Static HTML pages allow easy search engine integration
  • Static HTML pages allow easy project integration and collaborations

Harvester Publication: Bioinformatics. Harvester': a fast meta search engine of human protein resources. Liebel U, Kindler B, Pepperkok R. 2004 Aug 12;20(12):1962-3. Epub 2004 Feb 26.


YACY Sciencenet - Search ~300.000.000 scientific webpages/documents (Christen / Luetjohann / Liebel)


Sciencenet 300 Mio documents (magnify)
  • YaCy-Sciencenet is a distributed search engine prototype based on "peer2peer" technology (Michael Christen et al.).

Instead of copying the internet to a data-center (google approach),
YaCy peers (= PC + YaCy software) share data between the data serving machines itself.
Each YaCy-PC communicates with a network of distributed peers only if data is queried.

  • Features of YaCy search engines:
    • Allows indexing/searching anything from small websites to large scale data repositories
    • Indexes text, images, docs, pdf, ppt etc (200 document types)
    • Holds up to 1-100 Mio web pages or documents per peer
    • Tested from single PC installation up to ~100 low cost PC clusters
    • Requires no cooling racks (standard PCs)
    • Requires less power (compared to 1 HE server units)
  • Challenge/goal: An independent and open search engine for scientific content.
  • Status: "sciencenet-network ~300 Mio documents; "freeworld-network" ~ 10^9 documents

Plexus BioInterfaces large scale Data integration platform (Luetjohann, Trunov, Wezel, Christen, Liebel)

The Biointerfaces programme generates a vast amount of data in different fields. Literally 100s of TByte (and soon Pbytes ) and Mio of files need to be stored and processed. Flexible and transparent acces to all the data-sets is key for most BioInterfaces projects.
Instead of using a "super-Base" we develop a network of flexible data nodes.
All data nodes (image based, sequence based, NMR based etc will be indexed via the distributed search engine YaCy.
This architecture allows us to easily integrate new data projects, test new software versions or simply compare different data sources WITHOUT changing a database model.

  • "Plexus" can be searched via a custom YaCy Search engine interface.
    • YaCy is a distributed search engine (http://yacy.net) which uses the advantages of peer-to-peer technology.
    • A Simple google-like search interface will allow the user convenient access to all data nodes.
    • Individual data-viewer (e.g. OME-Omera for microscopy data) allow the user to browse the individual data sets.
  • Plexus
    • Currently "Plexus" is running in a prototype environment indexing some 300 Mio scientific documents on 25 standard Linux PCs
    • Feel free to try the YaCy-Sciencenet prototype: http://sciencenet.kit.edu
Personal tools