Spark powered wikipedia analysis and exploration

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Spark powered wikipedia analysis and exploration

Guillaume Pitel
Hi Spark users,

I don't know if it's the right place to announce it, but Spark has a new visible use case through a demo we put online here :

It allows you to explore the English Wikipedia with a few added benefits from our proprietary semantic and relations analysis method, so that you can see similar pages (based on text content or links), see the most relevant words for a page, and other stuff.

Spark is used for the processing of the English Wikipedia, and for the computation. It takes about 30 minutes for three iterations of our method on the whole 4.4M documents * 2.1M words matrix, on a smallish  cluster of 7 nodes with 4 core, 32GB RAM.

Any feedback is welcome (except on the aesthetic aspect, we already know the UI is really bad)

Enjoy exploring Wikipedia in your spare time :)

Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05