The extraction goal is to try and get the purest extraction from the beginning of the article for servicing Flipboard/Pulse type applications that need to show the first snippet of a web article along with an image.
Goose will try to extract the following information:
This video introduces you to how we filter and extract the content used to build the interest graph. The Demo allows you to extract content of an article URL of your choosing.
Goose was open sourced by Gravity.com in 2011 and is available on GitHub
Developer:
Contributors:
The GitHub wiki (https://github.com/GravityLabs/goose/wiki) has the full details on how to use Goose.
Goose is available for free and released by Gravity.com under the Apache 2.0 license. See the LICENSE file for all the details.
If you find Goose useful or have issues please drop me a line, I'd love to hear how you're using it or what features should be improved
To use goose from the command line:
cd into the goose directory
mvn compile
MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt
Goose was originally written in Java but was converted to a Scala project in August of 2011.
Here are some of the reasons for the port to Scala:
It was a pretty fast Java to Scala port so lots of the nicities of the Scala language aren't in the codebase yet, but those will come over the coming months as we re-write alot of the internal methods to be more Scala-esque. We made sure it was still nice and operable from Java as well so if you're using Goose from java you still should be able to use it with a few changes to the method signatures.