Keyword extractor's source code....where I can find it???
9 Message(s) by 2 Author(s) originally posted in java machine
| From: giugy |
Date: Thursday, January 11, 2007
|
Hi,sorry for my english but I do not speak it very well....
Someone knows where I can find the Keyword Extractor
source code
written in JAVA? A
software that analyzes a
text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).
I need to see the software's source code written in JAVA in order to
understand as it works....
Thaks,bye
| From: glen herrmannsfeldt |
Date: Thursday, January 11, 2007
|
wrote in message:
Someone knows where I can find the Keyword Extractor source code
written in JAVA? A software that analyzes a text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).
I need to see the software's source code written in JAVA in order to
understand as it works....
It is very easy to
write in JAVA.
First read a
line and extract words using StringTokenizer. Then
use a Hashtable to find out if you've seen that word before.
If so, increment a counter. If not, add it to the Hashtable with
a count of 1. I
store a long[] in the
hash table for convenience
in incrementing, but others will do something different.
One trick, though. After you extract words with StringTokenizer and
find they aren't in the table, create a new String to store the
reference in the hash table. If you do not it'll take up too much
memory, as the whole line of
character s is stored for each word.
After you finish reading the
file , go through the Hashtable,
extract words and counts, and print them out.
It shouldn't take long at all to write.
-- glen
| From: giugy |
Date: Tuesday, January 16, 2007
|
Yes, I've found a code like this....
import JAVA.io.*;
import JAVA.util.*;
class Counter implements Comparable {
private String word;
private
int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";
}
public boolean equals(Object obj) {
return obj instanceof Counter &&
((Counter)obj).word.equals(word);
}
public int hashCode() {
return word.hashCode();
}
public int compareTo(Object o) {
return word.compareTo(((Counter)o).word);
}
}
class CounterSet extends AbstractSet {
private Map set = new TreeMap();
public void addOrIncrement(String s) {
Counter c = new Counter(s);
if (set.containsKey(c))
((Counter)set.get(c)).increment();
else
set.put(c, c);
}
public Iterator
iterator () {
return set.keySet().iterator();
}
public int size() {
return set.size();
}
public String toString() {
return set.keySet().toString();
}
}
class
Word Count {
private FileReader file;
private StreamTokenizer st;
private CounterSet counts = new CounterSet();
WordCount(String filename)
throws FileNotFoundException {
try {
file = new FileReader(filename);
st = new StreamTokenizer(
new BufferedReader(file));
st.ordinaryChar('.');
st.ordinaryChar('-');
st.lowerCaseMode(true);
} catch(FileNotFoundException e) {
System.err.println(
"Couldn't
open " + filename);
throw e;
}
}
void cleanup() {
try {
file.close();
} catch(IOException e) {
System.err.println(
"file.close() unsuccessful");
}
}
void countWords() {
try {
while(st.nextToken() !=
StreamTokenizer.TT_EOF) {
String s = "a";
switch(st.ttype) {
case StreamTokenizer.TT_EOL:
s = new String("EOL");
break;
case StreamTokenizer.TT_NUMBER:
// s = Double.toString(st.nval);
break;
case StreamTokenizer.TT_WORD:
s = st.sval;
break;
default: // single character in ttype
s = String.valueOf((char)st.ttype);
}
if(s.length() > 3)
counts.addOrIncrement(s);
}
} catch(IOException e) {
System.err.println(
"st.nextToken() unsuccessful");
}
}
public Iterator iterator() {
return counts.iterator();
}
public String toString() {
return counts.toString();
}
}
public class KeyWordExtractor {
public static void main(String[] args)
throws FileNotFoundException {
for(int I = 0; i
< args.length; i++){
WordCount wc = new WordCount(args[i]);
wc.countWords();
System.out.println("WORD = " + wc);
wc.cleanup();
}
}
}and it give me to occurrency of every world in the text...in example if
I give in input a text like (a stupid example) "JAVA function JAVA
library function JAVA" in output I obtain WORD = [function[2] ,
JAVA[3] , library[1]] ....that are the occurrences of the word in the
text,but my problem is that I need in output not all the word of the
text...but only the the word that appears many times in the text...in
this case JAVA that is the keyword of the text....WORD = [JAVA]
I know that there is still little code to write,but I don't know well
JAVA and so I do not succeed to write it!!!
Please Help me....THANKS!!!
glen herrmannsfeldt ha scritto:
wrote in message:
> Someone knows where I can find the Keyword Extractor source code
> written in JAVA? A software that analyzes a text and extract the
> keyword of the text (the most present words in the text....for example
> the word "hello" is present forty times,the word "thanks" is present
> thirty times....).
> I need to see the software's source code written in JAVA in order to
> understand as it works....
It is very easy to write in JAVA.
First read a line and extract words using StringTokenizer. Then
use a Hashtable to find out if you've seen that word before.
If so, increment a counter. If not, add it to the Hashtable with
a count of 1. I store a long[] in the hashtable for convenience
in incrementing, but others will do something different.
One trick, though. After you extract words with StringTokenizer and
find they aren't in the table, create a new String to store the
reference in the hash table. If you do not it'll take up too much
memory, as the whole line of characters is stored for each word.
After you finish reading the file, go through the Hashtable,
extract words and counts, and print them out.
It shouldn't take long at all to write.
-- glen
| From: giugy |
Date: Tuesday, January 16, 2007
|
Yes, I've found a code like this....
import JAVA.io.*;
import JAVA.util.*;
class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";
}
public boolean equals(Object obj) {
return obj instanceof Counter &&
((Counter)obj).word.equals(word);
}
public int hashCode() {
return word.hashCode();
}
public int compareTo(Object o) {
return word.compareTo(((Counter)o).word);
}
}
class CounterSet extends AbstractSet {
private Map set = new TreeMap();
public void addOrIncrement(String s) {
Counter c = new Counter(s);
if (set.containsKey(c))
((Counter)set.get(c)).increment();
else
set.put(c, c);
}
public Iterator iterator() {
return set.keySet().iterator();
}
public int size() {
return set.size();
}
public String toString() {
return set.keySet().toString();
}
}
class WordCount {
private FileReader file;
private StreamTokenizer st;
private CounterSet counts = new CounterSet();
WordCount(String filename)
throws FileNotFoundException {
try {
file = new FileReader(filename);
st = new StreamTokenizer(
new BufferedReader(file));
st.ordinaryChar('.');
st.ordinaryChar('-');
st.lowerCaseMode(true);
} catch(FileNotFoundException e) {
System.err.println(
"Couldn't open " + filename);
throw e;
}
}
void cleanup() {
try {
file.close();
} catch(IOException e) {
System.err.println(
"file.close() unsuccessful");
}
}
void countWords() {
try {
while(st.nextToken() !=
StreamTokenizer.TT_EOF) {
String s = "a";
switch(st.ttype) {
case StreamTokenizer.TT_EOL:
s = new String("EOL");
break;
case StreamTokenizer.TT_NUMBER:
// s = Double.toString(st.nval);
break;
case StreamTokenizer.TT_WORD:
s = st.sval;
break;
default: // single character in ttype
s = String.valueOf((char)st.ttype);
}
if(s.length() > 3)
counts.addOrIncrement(s);
}
} catch(IOException e) {
System.err.println(
"st.nextToken() unsuccessful");
}
}
public Iterator iterator() {
return counts.iterator();
}
public String toString() {
return counts.toString();
}
}
public class KeyWordExtractor {
public static void main(String[] args)
throws FileNotFoundException {
for(int I = 0; i < args.length; i++){
WordCount wc = new WordCount(args[i]);
wc.countWords();
System.out.println("WORD = " + wc);
wc.cleanup();
}
}
}and it give me to occurrency of every world in the text...in example if
I give in input a text like (a stupid example) "JAVA function JAVA
library function JAVA" in output I obtain WORD = [function[2] ,
JAVA[3] , library[1]] ....that are the occurrences of the word in the
text,but my problem is that I need in output not all the word of the
text...but only the the word that appears many times in the text...in
this case JAVA that is the keyword of the text....WORD = [JAVA]
I know that there is still little code to write,but I don't know well
JAVA and so I do not succeed to write it!!!
Please Help me....THANKS!!!
glen herrmannsfeldt ha scritto:
wrote in message:
> Someone knows where I can find the Keyword Extractor source code
> written in JAVA? A software that analyzes a text and extract the
> keyword of the text (the most present words in the text....for example
> the word "hello" is present forty times,the word "thanks" is present
> thirty times....).
> I need to see the software's source code written in JAVA in order to
> understand as it works....
It is very easy to write in JAVA.
First read a line and extract words using StringTokenizer. Then
use a Hashtable to find out if you've seen that word before.
If so, increment a counter. If not, add it to the Hashtable with
a count of 1. I store a long[] in the hashtable for convenience
in incrementing, but others will do something different.
One trick, though. After you extract words with StringTokenizer and
find they aren't in the table, create a new String to store the
reference in the hash table. If you do not it'll take up too much
memory, as the whole line of characters is stored for each word.
After you finish reading the file, go through the Hashtable,
extract words and counts, and print them out.
It shouldn't take long at all to write.
-- glen
| From: glen herrmannsfeldt |
Date: Wednesday, January 17, 2007
|
wrote in message:
Yes, I've found a code like this....
import JAVA.io.*;
import JAVA.util.*;
class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";
Change this to:
return count=" "+word;
The the output will have a
list of count followed by word, and
can be input to the unix command
sort -rn unsortedfile > sortedfile
which will output the list with the most common word first.
(snip)
-- glen
| From: giugy |
Date: Wednesday, January 17, 2007
|
Sorry but maybe I make a stupid
error e....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;
I obtain an error like this "found: JAVA.lang.String required: int" ,
because count is an it and word is a
string and the function required
gives back a String....how can I do?
glen herrmannsfeldt ha scritto:
wrote in message:
> Yes, I've found a code like this....
>
> import JAVA.io.*;
> import JAVA.util.*;
>
> class Counter implements Comparable {
> private String word;
> private int count;
> public Counter(String word) {
> this.word = word;
> count = 1;
> }
> public void increment() { count++; }
> public String toString() {
> return "\n" + word + " [" + count + "]";
Change this to:
return count=" "+word;
The the output will have a list of count followed by word, and
can be input to the unix command
sort -rn unsortedfile > sortedfile
which will output the list with the most common word first.
(snip)
-- glen
| From: giugy |
Date: Wednesday, January 17, 2007
|
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;
I obtain an error like this "found: JAVA.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can I do?
glen herrmannsfeldt ha scritto:
wrote in message:
> Yes, I've found a code like this....
>
> import JAVA.io.*;
> import JAVA.util.*;
>
> class Counter implements Comparable {
> private String word;
> private int count;
> public Counter(String word) {
> this.word = word;
> count = 1;
> }
> public void increment() { count++; }
> public String toString() {
> return "\n" + word + " [" + count + "]";
Change this to:
return count=" "+word;
The the output will have a list of count followed by word, and
can be input to the unix command
sort -rn unsortedfile > sortedfile
which will output the list with the most common word first.
(snip)
-- glen
| From: giugy |
Date: Wednesday, January 17, 2007
|
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;
I obtain an error like this "found: JAVA.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can I do?
glen herrmannsfeldt ha scritto:
wrote in message:
> Yes, I've found a code like this....
>
> import JAVA.io.*;
> import JAVA.util.*;
>
> class Counter implements Comparable {
> private String word;
> private int count;
> public Counter(String word) {
> this.word = word;
> count = 1;
> }
> public void increment() { count++; }
> public String toString() {
> return "\n" + word + " [" + count + "]";
Change this to:
return count=" "+word;
The the output will have a list of count followed by word, and
can be input to the unix command
sort -rn unsortedfile > sortedfile
which will output the list with the most common word first.
(snip)
-- glen
| From: glen herrmannsfeldt |
Date: Wednesday, January 17, 2007
|
wrote in message:
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;
I obtain an error like this "found: JAVA.lang.String required: int" ,
Sorry, it was supposed to say return count+" "+word;
In both the original and this one, the int is converted to String.
By the way, you do not need to post three times for us to read it.
-- glen
Next Message: jvm/garbage collection project idea