Feb
01

Story One

A. a man who is dying of hungry staggered to a farmer’s house. The farmer gave him a steamed bun and some water, and then the dying man survived. Thereafter, the dying man found that the farmer is as poor as a church mouse. He was moved with tears in his eyes, and got down on his kneels for a long time to express gratitude. In the rest of his life, he keeps helping other in need.

B. a man who is dying of hungry staggered to a farmer’s house. The farmer gave him a steamed bun and some water, and then the dying man survived. However, the man found that the farmer is so rich and have lots of delicious food in the living room. Instead of expressing gratitude, he felt angry since he think he was treated badly. Thereafter, he grabed a knife and killed the farmer.

Thought1: Why the same action gets different results? —— see what you get, you will be graceful; see what you donot get, you will remember the animosity.

Thought2: To be graceful or angry, is always determined by how you do, not what you do. Essentially, the dying man should express his gratitude, in both circumstance A and B. However, only a rational man can do this in reality.

Thought3: Apart from being graceful or angry, does there a third status of mentality?

Two

A. a man who is dying of hungry staggered to a farmer’s house. The farmer gave him a steamed bun and some water, and then the dying man survived. Later, the people found the saved man is a corrupted official wanted by law. So many people hate the farmer, because he help the bad guy.

B. a man who is dying of hungry staggered to a farmer’s house. The farmer gave him a steamed bun and some water, and then the dying man survived. Later, the people found the saved man is very kind and help a lot of poor people. So, everybody praise the farmer for his great kindness.

Thought1: if we show charity to people according to who he is, not who is in need, everybody will have his own standards. Soon after, nobody wants to do charity. If we show charity to the wrong guy, we will be punished.

Thought2: If the people praise the farmer in both cases, the world will be full of love. Otherwise, we will be surrounded by cautions and snobbishness.

Three

A. a man who is dying of hungry staggered to a farmer’s house. This farmer pointed to another farmer’s home and said : “look, that family used to give you the food, come to them please”. When the starving man staggered to that farmers home and begged for food, the farmer said regretfully that “we donot have food for you”. Later, everybody in the town criticize the farmer used to help the starving man, but not one criticizing the first farmer.

B. a man who is dying of hungry staggered to a farmer’s house. This farmer pointed to another farmer’s home and said : “look, that family used to give you the food, come to them please”. When the starving man staggered to that farmers home and begged for food, the farmer said this time that “we only have half a steamed bun for you”. Later, everybody in the town laughed at the farmer for his parsimony.

Thought1: For a man used to be full of love in his heart, people always get a higher expectation for him.

Thought2: For a man who is selfish, since we get a low expectation for home, we will not be disappointed.

Thought3: This kind of expectation always expels one’s charity.

Four

A: a man who is dying of hungry staggered to a farmer’s house. The farmer gave him a steamed bun and some water, and then the dying man survived. Thereafter, the dying man found that the farmer is as poor as a church mouse. He was moved with tears in his eyes, and got down on his kneels for a long time to express gratitude. In the rest of his life, he keeps helping other in need. In addition, he support every decision made by the farmer no matter it is right or wrong. This make the farmer arrogant and then bankrupt because of making a wrong decision.

B: a man who is dying of hungry staggered to a farmer’s house. The farmer gave him a steamed bun and some water, and then the dying man survived. Thereafter, the dying man found that the farmer is as poor as a church mouse. He was moved with tears in his eyes, and got down on his kneels for a long time to express gratitude. In the rest of his life, he keeps helping others in need. However, he also tell the farmer the right thing to do, instead of buttering up the farmer for to repay the gratitude.

Thought1: Different way to repay the gratitude get different results

Thought2: you should keep thinking in the right way all the time. Telling the truth and criticizing are alway the best means.

Jan
25

1. Interesting highlighting in search results or snippets.

2. synonym expansion — query expansion

3. Social Search in Google labs

4. Google Squared

extract interesting facts from WEB page, and present them in meaningful way to you

Nov
29

vi/vim command summary

vi/vim command summary

The following tables contain all the basic vi commands.
Starting vi

Command Description
vi file start at line 1 of file
vi +n file start at line n of file
vi + file start at last line of file
vi +/pattern file start at pattern in file
vi -r file recover file after a system crash

Saving files and quitting vi

Command Description
:e file edit file (save current file with :w first)
:w save (write out) the file being edited
:w file save as file
:w! file save as an existing file
:q quit vi
:wq save the file and quit vi
:x save the file if it has changed and quit vi
:q! quit vi without saving changes

Moving the cursor

Keys pressed Effect
h left one character
l or <Space> right one character
k up one line
j or <Enter> down one line
b left one word
w right one word
( start of sentence
) end of sentence
{ start of paragraph
} end of paragraph
1G top of file
nG line n
G end of file
<Ctrl>W first character of insertion
<Ctrl>U up ½ screen
<Ctrl>D down ½ screen
<Ctrl>B up one screen
<Ctrl>F down one screen

Inserting text

Keys pressed Text inserted
a after the cursor
A after last character on the line
i before the cursor
I before first character on the line
o open line below current line
O open line above current line

Changing and replacing text

Keys pressed Text changed or replaced
cw word
3cw three words
cc current line
5cc five lines
r current character only
R current character and those to its right
s current character
S current line
~ switch between lowercase and uppercase

Deleting text

Keys pressed Text deleted
x character under cursor
12x 12 characters
X character to left of cursor
dw word
3dw three words
d0 to beginning of line
d$ to end of line
dd current line
5dd five lines
d{ to beginning of paragraph
d} to end of paragraph
:1,. d to beginning of file
:.,$ d to end of file
:1,$ d whole file

Using markers and buffers

Command Description
mf set marker named “f”
`f go to marker “f”
´f go to start of line containing marker “f”
“s12yy copy 12 lines into buffer “s”
“ty} copy text from cursor to end of paragraph into buffer “t”
“ly1G copy text from cursor to top of file into buffer “l”
“kd`f cut text from cursor up to marker “f” into buffer “k”
“kp paste buffer “k” into text

Searching for text

Search Finds
/and next occurrence of “and”, for example, “and”, “stand”, “grand”
?and previous occurrence of “and”
/^The next line that starts with “The”, for example, “The”, “Then”, “There”
/^The\> next line that starts with the word “The”
/end$ next line that ends with “end”
/[bB]ox next occurrence of “box” or “Box”
n repeat the most recent search, in the same direction
N repeat the most recent search, in the opposite direction

Searching for and replacing text

Command Description
:s/pear/peach/g replace all occurrences of “pear” with “peach” on current line
:/orange/s//lemon/g change all occurrences of “orange” into “lemon” on next line containing “orange”
:.,$/\<file/directory/g replace all words starting with “file” by “directory” on every line from current line onward, for example, “filename” becomes “directoryname”
:g/one/s//1/g replace every occurrence of “one” with 1, for example, “oneself” becomes “1self”, “someone” becomes “some1”

Matching patterns of text

Expression Matches
. any single character
* zero or more of the previous expression
.* zero or more arbitrary characters
\< beginning of a word
\> end of a word
\ quote a special character
\* the character “*
^ beginning of a line
$ end of a line
[set] one character from a set of characters
[XYZ] one of the characters “X”, “Y”, or “Z”
[[:upper:]][[:lower:]]* one uppercase character followed by any number of lowercase characters
[^set] one character not from a set of characters
[^XYZ[:digit:]] any character except “X”, “Y”, “Z”, or a numeric digit

Options to the :set command

Option Effect
all list settings of all options
ignorecase ignore case in searches
list display <Tab> and end-of-line characters
mesg display messages sent to your terminal
nowrapscan prevent searches from wrapping round the end or beginning of a file
number display line numbers
report=5 warn if five or more lines are changed by command
term=ansi set terminal type to “ansi”
terse shorten error messages
warn display “[No write since last change]” on shell escape if file has not been saved

Sep
20

Download Whole Website or Directories by using wget in Linux

You might have googled a software for downloading a specified website or directory on either Windows or Linux platform . Yes, a bunch of tools can do this for you. Actually, we can do this by using a simple command, wget, on Linux platform. It is highly customizable, just a powerful crawler. You will find it fantastic and really cool. Let me just show you how!

wget \

–recursive \

–no-clobber \

–page-requisites \

–html-extension \

–convert-links \

–restrict-file-names=windows \

–domains techstroke.com \

–no-parent \

www.techstroke.com/Windows/

The command above let you download the “windows” directory at the domain of “techstroke.com” recursively, starting from the url  www.techstroke.com/Windows/

How do you like it? Hah, really cool?

Finally, let me explain a bit more about the parameters. Of course, you can refer to its documentation.

The options are:

–recursive: download the entire Web site.

–domains-techstroke.com: don’t follow links outside techstroke.com.

–no-parent: don’t follow links outside the directory /Windows/.

–page-requisites: get all the elements that compose the page (images, CSS and so on).

–html-extension: save files with the .html extension.

–convert-links: convert links so that they work locally, off-line.

–restrict-file-names=windows: modify filenames so that they will work in Windows as well.

–no-clobber: don’t overwrite any existing files (used in case the download is interrupted and

resumed).

Sep
11
 
 

Sent to you by Jeffye via Google Reader:

 
 

via LingPipe Blog by lingpipe on 9/9/09

Bayesian Inference is Based on Probability Models

Bayesian models provide full probability distributions over both observable data and unobservable model parameters. Bayesian statistical inference is carried out using standard probability theory.

What’s a Prior?

The full Bayesian probability model includes the unobserved parameters. The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution.

Non-Bayesian Statistics

Non-Bayesian statisticians eschew probability models of unobservable model parameters. Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5′6″ to 6′0″).

Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades.

Bayesian Technical Apparatus

The sampling distribution models the probability of observable data given unobservable model parameters .

The prior distribution models the probability of the parameters .

The full joint distribution over parameters and data is computed with the chain rule, .

The posterior distribution of the parameters given the observed data is derived from the sampling and prior distributions via Bayes’s rule,

The posterior predictive distribution for new data given observed data is the average of the sampling distribution over parameters proportional to their posterior probability,

The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. In particular, the posterior is an overdispersed variant of the sampling distribution. The extra dispersion arises by integrating over the posterior.

Conjugate Priors

Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson.

Point Estimate Approximations

An approximate alternative to full Bayesian inference uses for prediction, where is a point estimate.

The maximum of the posterior distribution provides the-so called maximum a posteriori (MAP) estimate,

\theta^* = \arg\max_{\theta} p(\theta|y) = \arg\max_{\theta} p(y|\theta) \, p(\theta)

If the prior is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood of the data . The MLE is popular among non-Bayesian statisticians because the prior may be dropped from the optimization because it only contributes a constant factor.

By definition, the unbiased estimator for the parameter is the expected value of the posterior,

\bar{\theta} = {\mathbb E}_{p(\theta|y)}[\theta] = \int_{\Theta} \theta \, p(\theta|y) \, d\theta

Point estimates may be reasonably accurate if the posterior has low variance. If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model.

 
 

Things you can do from here:

 
 
Sep
03

Reminder, Lucene has many Query types

– TermQuery, BooleanQuery,

ConstantScoreQuery, MatchAllDocsQuery,

MultiPhraseQuery, FuzzyQuery,

WildcardQuery, RangeQuery, PrefixQuery,

PhraseQuery, Span*Query,

DisjunctionMaxQuery, etc.

There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very large.

As you know that lucene will rewrite the original Query, but some of the implement could be ineffective. Let’s see the code snippet in RangeQuery first.

public RangeQuery(Term lowerTerm, Term upperTerm, boolean inclusive,

Collator collator)

{

this(lowerTerm, upperTerm, inclusive);

this.collator = collator;

}


public Query rewrite(IndexReader reader) throws IOException {


BooleanQuery query = new BooleanQuery(true);

String testField = getField();

if (collator != null) {

TermEnum enumerator = reader.terms(new Term(testField, “”));

String lowerTermText = lowerTerm != null ? lowerTerm.text() : null;

String upperTermText = upperTerm != null ? upperTerm.text() : null;


try {

do {

Term term = enumerator.term();

if (term != null && term.field() == testField) { // interned comparison

if ((lowerTermText == null

|| (inclusive ? collator.compare(term.text(), lowerTermText) >= 0

: collator.compare(term.text(), lowerTermText) > 0))

&& (upperTermText == null

|| (inclusive ? collator.compare(term.text(), upperTermText) <= 0

: collator.compare(term.text(), upperTermText) < 0))) {

addTermToQuery(term, query);

}

}

}

while (enumerator.next());

}

finally {

enumerator.close();

}

}

……………

}

As we can see from this the source code, a RangeQuery may be rewrited into thousands of TermQuery. This will make search ineffective, or even cause “TooManyClauses exception”. In addition, the rewrite method in RangeQuery will traverse through the entire dictionary. This is another reason why RangeQuery would make the search operation slow.

In contrast to RangeQuery, RangeFilter will do this job faster. Although RangeFilter will also traverse through the entire dictionary,  it does not have additional search operation as RangeQuery.

The implement of RangeFilter in lucene  will not consume much memory. It will only used for approximate 12.5M memory for a collection with 10M documents. According to the statement above, I would recommend you to use RangeFilter rather than RangeQuery.

Actually, ConstantScoreRangeQuery is a wrapper of RangeFilter, which enables us to conduct range search.  ConstantScoreRangeQuery returns a constant score equal to its boost for all documents in the range. It’s better than RangeQuery when we want to restrict the spectrum of the result rather than to rank the results partly according to the score by the RangeQuery.

Notes: The implements of FuzzyQuery, WildcardQuery, RangeQuery and PrefixQuery are pretty much the same, also be careful of using them.

Aug
15
 
 

Sent to you by Jeffye via Google Reader:

 
 

via 王晓阳 by 王晓阳 on 8/13/09

中国军队当年消灭了多少日军?

    中日战争,是令中国现代历史走向发生重大转折一个事件。其影响中国历史的程度,远远超过了辛亥革命、五四运动、北伐战争等等。其结果对中国今天的时局仍在发挥重大影响。

     明天,815,是日本投降、二战结束纪念日。很多国家都会纪念,中国也不例外。问题是,谁有资格庆祝?

    过去,大陆的教科书一直说当年国民政府不抗日,只有中共抗日;后来,逐步承认国民政府是抗日主力。现在,又有一些人说中共当年根本不抗日。事实到底是怎么回事?

    “中共当年根本不抗日”的说法,过于武断了。因为当时苏联已经指示中共要抗日,中共也提出抗日口号“保卫苏联”,并且也确实打了一些仗。

    我们来让数字说话。如果连自己当年消灭了多少侵略者都说不清,那又怎么抢功劳呢?

    历史学家要有良心,要对得起那些长眠于地下的抗日先烈。 贪天之功者,要遭雷劈。

 

1,政府军和共军各消灭了多少日军?

    抗日战争期间,在华日军人数最多时有近200万,这个数字基本没有分歧。有分歧的,是有多少日军死亡。有多种数字。按照美国学者根据日本战中统计计算,在大陆被击毙的日军,共计44万余。研究抗战历史的专家张忠义先生,旁征博引日军史料,也得出一个接近的数字,45.5万人。国民党军参谋总长何应钦在《八年抗战》中公布的数字则为48万,而中国革命军事博物馆则采用建国后综合统计后的数字55万。当然,也有对此持有异议的专家学者,比如社科院的刘大年教授,就根据国民党军战地统计数字计算,日军在中国阵亡人数超过100万人。

    必须说明的是,后来苏联为了抢地盘,急忙出兵中国东北,消灭了约60关东军,这属于两个强盗在中国抢地盘,不是功劳,更不能记在中国军队的功劳簿上。并且,当时日本人以及一些海外学者认为东北是满州国,所以从来不把关东军的死亡数字统计在严格的中国战区。

    日本权威历史学家伊藤正德(《帝国陆军史》的作者)在他的书中,记录战死在中国的日军,共计789370。这个数字比较可信。

    当时,除了中国政府军外,只有共产党拥有军队了。那么,两者分别消灭多少日军呢?

    目前大陆的学者,有人倾向于认可伊藤正德的数据——共产党领导的武装,消灭了日军20万人;589370人,是政府军消灭的。中共军队消灭20万日本军人,不如大陆以前宣传的多,但是起码证明了“中共当年根本不抗日”的说法是错误的。中共当年的确抗日了

    也有人倾向于总共消灭日军44万, 国民革命军消灭40万, 共军消灭2万, 其他死亡2万的说法。

    8年抗战才消灭日军最多不到79万人,很惭愧。苏联人到中国抢地盘的工夫,就消灭了60万日军。不过,考虑到当时中国军队的武器装备,可以理解。

      另:国民革命军远征军在缅甸等地消灭的日军人数,未统计在内。

 

2,政府军和共军的战报

    必须考虑到日军当年极力缩小自己对外公布的伤亡数字,而国军、共军则要夸张自己的歼敌数字。

   八路军部分战绩与日军战报的对比
  1、平型关战斗
  八路战报:歼灭日军1000余人
  日军战报:日军亡167人,伤94人(儿岛襄著:《日中战争》,日本文艺春秋社1984年版)

  2、广阳伏击战
  八路战报:歼日军千余人
  日军战报:日军伤亡63人(臼井胜美著《中日战争》)

  3、晋察冀区反八路围攻
  八路战报:歼灭日伪军2000余人
  日军战报:日军亡17人,伤52人;皇协军伤亡69人(臼井胜美著《中日战争》)

  4、三次破袭平汉路
  八路战报:歼灭日伪军1200余人
  日军战报:日军亡2人,伤11人,无皇协军伤亡报告(《支那事变陆军作战》) 1938年

  5、冀中1938年春季反“扫荡”
  八路战报:歼灭日伪军1000余人
  日军战报:日军亡6人,伤26人, 皇协军伤亡71人(《华北治安战》)

   6、120师收复晋西北七城战役
  八路战报:歼灭日伪军1500余人
  日军战报:日军亡22人,伤51人,皇协军伤亡101人(《华北治安战》)

  7、易(县)涞(源)战斗
  八路战报: 歼日伪军1400余人
  日军战报:日军亡9人,伤22人,皇协军伤亡40人(《支那事变陆军作战》)

  8、129师晋东南反日军九路围攻
  八路战报:歼日伪军4000余人
  日军战报:日军亡11人,伤10人,皇协军伤亡79人(《华北治安战》)

  9、晋察冀区1938年秋反围攻
  八路战报: 毙伤日伪军5000余人
  日军战报:日军亡39人,伤132人,皇协军伤亡107人(臼井胜美著《中日战争》)

  10、冀中区五次反围攻
  八路战报:歼日伪军5500余人
  日军战报:日军亡21人,伤65人,皇协军伤亡99人(臼井胜美著《中日战争》)

  11、冀南1938年反“扫荡”
  八路战报: 毙俘日伪军600余人
  日军战报:日军亡3人,伤11人,皇协军伤亡16人(臼井胜美著《中日战争》) 1939年
 
  12、冀南春季反十一“扫荡”
  八路战报:歼日伪军3000余人
  日军战报:日军亡37人,伤70人,皇协军伤亡81人(臼井胜美著《中日战争》)

  13、115师陆房突围
  八路战报:毙伤日伪军1300余人
  日军战报:日军亡10人,伤122人,皇协军伤亡67人(《华北治安战》)

  14、五台山区1939年5月反围攻
  八路战报:歼灭日军宫崎部队800余人
  日军战报:日军亡4人,伤27人(《华北治安战》)

  15、太行区1939年夏季反“扫荡”
  八路战报:歼日伪军2000余人
  日军战报:日军亡7人,伤37人,皇协军伤亡70人(《华北治安战》)

  16、冀中1939年冬季反“扫荡”
  八路战报:歼日伪军2500余人
  日军战报:日军亡27人,伤89人,皇协军伤亡71人(《华北治安战》)

  17、北岳区1939年冬季反“扫荡”
  八路战报:毙伤日伪军3600余人
  日军战报:日军亡9人,伤34人,皇协军伤亡95人(《华北治安战》) [ 1940年

  18、平西区1940年春季反“扫荡”
  八路战报:歼灭日伪军800余人,击落日军飞机1架
  日军战报:日军亡8人,伤40人,皇协军伤亡22人(《华北治安战》)

  19、冀中1940年春季反全面“扫荡”作战
  八路战报:毙伤日伪军3000余人
  日军战报:日军亡11人,伤91人,皇协军伤亡62人(《华北治安战》)

  20、抱犊崮山区反“扫荡”(亦称鲁南区1940年反“扫荡”)
  八路战报: 毙伤日伪军2200余人
  日军战报:日军亡9人,伤60人,皇协军伤亡58人(《华北治安战》)

  21、129师白晋铁路破击战
  八路战报:歼日伪军600余人
  日军战报:日军亡2人,伤9人,皇协军伤亡12人(《华北治安战》)

  22、晋西北1940年夏季反“扫荡”
  八路战报:毙伤日伪军4490余人俘53人(内含日军11人)
  日军战报:日军亡37人,伤107人,失踪3人,皇协军伤亡失踪201人(《华北治安战》)

  23、冀中1940年夏季“青纱帐”战役 [
  八路战报:毙伤日伪军2100余人俘伪军500余人
  日军战报:日军亡19人,伤22人,皇协军伤亡39人(《华北治安战》)

  24、百团大战
  八路战报:毙伤日军2万余人、伪军5000余人,俘日军280余人、伪军1.8万余人
  日军战报:亡302人,伤1719人,皇协军伤亡失踪1202人(《华北治安战》)

  25、太行区1940年秋季反“扫荡”
  八路战报: 歼日伪军2800余人
  日军战报:日军亡29人,伤60人,皇协军伤亡44人(《华北治安战》)

  26、冀中1940年冬季攻势
  八路战报: 歼日伪军2300余人
  日军战报:日军亡10人,伤27人,皇协军伤亡59人(《华北治安战》)

  27、太岳1940年冬季反“扫荡”
  八路战报:歼日伪军260余人
  日军战报:日军伤7人,皇协军伤亡15人(《华北治安战》)

  28、晋西北1940年冬季反“扫荡”
  八路战报:毙伤日伪军2500余人
  日军战报:日军亡8人,伤44人,皇协军伤亡102人(《华北治安战》)
 

  国军方面
  1、凇沪会战
  国军1937年战报:日军伤亡6万余人;孙元良个人在2005年估计日军伤亡4到5万。
  日军战报:日军在1937年公布自身死亡9115人,伤31157人,共计伤亡40672人.

  2、太原会战
  国军战报:毙伤日军4万余人
  日军战报:日军伤亡2.6万余人(《中国事变陆军作战史》)

  3、南京保卫战
  国军战报:毙伤日军1.5万余人
  日军战报:日军伤亡7600余人(《中国事变陆军作战史》)

  4、徐州会战
  国军战报:毙伤日军5万余人
  日军战报:日军在1937年承认伤亡3.2万余人

  5、武汉会战
  国军战报:毙伤日军20万余人
  日军战报:自身伤亡3万余人,因病减员6.7万余人(《中国事变陆军作战》)

  6、随枣会战
  国军战报:毙伤日军4万余人
  日军战报:日军伤亡1.3万余人(《中国事变陆军作战》)

  7、枣宜会战
  国军战报:毙伤日军2.3万人
  日军战报:日军伤亡9000余人(《中国事变陆军作战》)

  8、南昌会战
  国军战报:毙伤日军1.2万人
  日军战报:日军伤亡9000余人(《中国事变陆军作战》)

  13、上高会战
  国军战报:毙伤日军2万人
  日军战报:日军伤亡9000余人,病减员6000人(《中国事变陆军作战》)
 
  14、晋南(中条山)会战
  国军战报:毙伤日军9900人
  日军战报:日军损失计战死670名,负伤2292名(《中国事变陆军作战》)

  15、第二次长沙会战
  国军战报:毙伤日军2万余人(也有说4万)
  日军战报:日军伤亡7000余人(《中国事变陆军作战》)

  16、第三次长沙会战
  国军战报:毙伤日军5万余人
  日军战报:伤亡6000人,其中死亡1600人(《中国事变陆军作战》)

  17、浙赣会战
  国军战报:毙伤日军3万余人
  日军战报:日军伤亡17148人(《中国事变陆军作战》)

  18、鄂西会战
  国军战报:毙伤日军4万余人
  日军战报:日军损失4000余人(《中国事变陆军作战》)

  19、常德会战
  国军战报:毙伤日军5万余人
  日军战报:日军损失2万余人(《中国事变陆军作战》)
  20、豫中会战
  国军战报:毙伤日军4000余人
  日军战报:日军损失3350人(《中国事变陆军作战》)
 
  21、长衡会战
  国军战报:毙伤日军6万余人
  日军战报:日军损失6万余人(双方数字惊人的相似)(《中国事变陆军作战》)
 
  22、桂柳会战
  国军战报:毙伤日军3万余人
  日军战报:日军损失1.6万余人(《战史丛书–大本营陆军部》) [23、缅北会战 [

  国军战报:毙伤日军9万余人
  日军战报:日军伤亡4万余人(《中国事变陆军作战》)

  注:《中国事变陆军作战》和《支那事变陆军作战》,为同一本书,都是日本防卫厅在20世纪60、70年代编写的,是日本军事院校的教科书。 以上日方的资料全部来自日本国内。
    日本方面甚至清楚到每个伤亡的名字。可怜我们的无名英雄。

 

3,日军死亡的将领是被哪一方面消灭的?

    总的人数,容易滥竽充数。死亡的将军,就不容易造假了。我们来看数字:中日战争中,共129名日本将官阵亡,除去病死,自杀,飞机失事,死于苏蒙军、中美联合航空队之外,有50名将军死于中国军队之手,其中死于国军45人,死于共军5人,含一名刺杀身亡的

    与国民革命军作战:
  林大八,陆军少将,1932年3月1日,死于上海。
  仓永辰治,陆军少将,1937年8月29日,死于上海吴淞。
  家纳治雄,陆军少将,1937年10月11日,死于上海。
  浅野嘉一,陆军少将,1937年11月14日, 战伤致死天津。
  加藤仁太郎,海军少将,1938年7月31日,死于长江下游 。
  杵春久藏,陆军少将,1938年8月2日,死于山西运城。
  饭冢国五郎,陆军少将,1938年9月3日,死于江西德安。
  小笠原数夫,陆航中将,1938年9月4日,坐机于湖北孝感被击毁。
  饭野贤十,陆军少将,1939年3月22日,死于南昌。
  山田喜藏,陆军少将,1939年5月12日,死于湖北大洪山。
  田路朝一,陆军中将,1939年6月17日,死于安徽南部。
  小林一男,陆军少将,1939年12月21日,死于内蒙古安北。
  中村正雄,陆军中将,1939年12月25日,死于广西昆仑关。
  秋山静太郎,陆军少将,1940年1月23日,死于山东。
  左藤谦,陆军少将,1940年3月2日,死于江西鄱阳湖。
  木谷资俊,陆军中将,1940年3月20日,死于江西。
  水川伊夫,陆军中将,1940年3月22日,死于内蒙古五原。
  前田治,陆军中将,1940年5月23日,死于山西晋城。
  藤堂高英,陆军中将,1940年6月3日,死于江西瑞昌。
  大冢彪雄,陆军中将,1940年8月5日,死于晋东南。
  井山官一,陆军少将,1940年10月16日,死于湖北宜昌。
  大角芩生,海军大将,1941年2月5日,坐机于广东中山被击毁。
  须贺彦次郎,海军中将,1941年2月5日 坐机于广东中山被击毁。
  上田胜,陆军少将,1941年5月13日,死于山西中条山。
  山县业一,陆军中将,1941年12月25日,死于安徽。
  酒井直次,陆军中将,1942年5月28日,死于浙江南溪。
  冢田攻,陆军大将, 1942年12月18日,死于安徽太湖。
  藤原武,陆军少将,1942年12月18日,死于安徽太湖。
  浅野克己,陆军少将,1943年5月,死于广东东江。
  仁科馨,陆军少将,1943年6月1日,死于湖南。
  黑川邦辅,陆军少将,1943年6月28日,死于云南。
  布上照一,陆军少将,1943年11月23日,死于湖南常德。
  中?护一,陆军少将,1943年11月25日死于湖南常德。
  下川义忠,陆军中将, 1944年4月19日,死于湖北应城。
  横山武彦,陆军中将, 1944年6月11日,死于浙江龙游。
  木村千代太,陆军中将,1944年6月11日,死于河南。
  和尔基隆,陆军少将 , 1944年7月21日,死于湖南衡阳。
  大桥彦四郎,陆军少将,1944年7月25日,死于湖南长衡会战。
  左治直影,陆军少将,1944年7月27日,死于湖北荆州。
  志摩源吉,陆军中将,1944年8月6日,死于湖南衡阳。
  藏重康美,陆军少将,1944年8月16日,死于云南腾冲。
  南野丰重,陆军少将,1944年9月8日,死于云南芒市。
  与野山寿,陆军少将,1945年2月9日,死于华中。
  山县正乡,海军大将,1945年3月7日,死于浙江椒江。

 

与八路军作战

  沼田德重,陆军中将,1939年8月12日,被八路军击伤死于山东。
  阿部规秀,陆军中将,1939年11月7日,与八路军作战死于河北涞源。
  吉川贞佐,陆军少将,1940年5月17日 被共产党员刺杀于河南开封。
  饭田泰次郎,陆军中将,1940年11月28,与八路军作战死于华北。

    吉川资,陆军少将,1945年5月7日,与八路军作战死于山东半岛。

  

    战争是要死人的,那么,国共军队各死亡多少,在抗日战争结束后,双方的军队又分别减少或增加了多少呢?下文叙述。

 

链接:

    《抗日战争:掉进了苏联陷阱》

       《多少中国军人死于抗日战争?》

 

 
 

Things you can do from here:

 
 
Aug
12

The internet at sort-of-40. How did we get here?

We’re looking to compile a history of the internet, by the internet. Want to help?

Man holding up laptop displaying smiley face

Photograph: Microzoa/Getty Images

The internet is sort-of-40 this year. Not in the sense of aHollywood actor who is in reality much older but prefers to act vague, however. In the sense that if you set the October 1969 networking of US research universities through Arpanet as the start point then it is a significant birthday.

To mark this, we want to tell the internet’s story. This is not the first time this has been done and will not be the last, but we want to tell the story of the internet using the internet – that is, the people who use it.

Below there is a list of 30 events from the past 40 years – encompassing the technological development of the internet and some of the impact it has had on culture, business, politics and society. Some of that makes for entertaining reading – reaction to the first piece of spam (a US army major gets involved) or the 1982 conversation that led to the first use of the :-) emoticon.

But these 30 events are not the only ones that mattered. There is no YouTube on here, nothing of Barack Obama’s use of the web for fundraising – and that is intentional. We’d like to know what you think is significant.

At the bottom of this page is a form where we would like you to nominate events memorable to you, be they ones we may already know about or something more personal such as the first websites you used or emails you sent. Our list is, for example, light on social media moments or internet dating. Or the thrill of a first Geocities site.

Maybe you did some of this pioneering work in the early days of the internet and want to talk about it. Whatever your experiences, we’d like to hear from you.

Where will it end? Well, this is a work in progress. But we will publish updates to the list and this autumn hope to produce an impressive told-by-the people version of the internet story

And here is the list of 30 …

1969 Arpanet starts Computers at two academic departments in California are linked by Arpanet, the predecessor of the internet
1971 @ Ray Tomlinson devises electronic mail for arpanet. He settles on @ to separate the name of the user from the name of their computer
1971 Project Gutenberg Michael Hart begins a project to make copyright-free works electronically available. The first text is the US Declaration of Independence, now archived as gutenberg.org/etext/1
1971 Expansion The network is now connecting 23 hosts
1973 ARPAWOCKY Early network humour: Twas brillig, and the Protocols / Did USER-SERVER in the wabe./ All mimsey was the FTP, / And the RJE outgrabe
1973 To Europe Norway is connected to Arpanet via Norsar, a US-Norwegian network to relay information on earthquakes and nuclear explosions. From Norway, a connection goes to University College London
1974 TCP/IP Vint Cerf and others publish a proposal to link up Arpa-like networks. It has no central control and is built around a protocol (TCP/IP) for the exchange of data
1976 Royal email Queen Elizabeth sends her first email on a visit to the MoD’s scientific research hub
1978 Spam Gary Thuerk sends what is now considered thefirst unsolicited commercial email. Major Raymond Czahor of the US defence communications agency assures Arpanet users it will not happen again
1978 Bulletin boards The first bulletin board is developed during a particularly bad blizzard in Chicago. Ward Christensen’s creation allows computer users with a modem to talk to each other and exchange software and data
1982 :-) Scott Fahlman proposes the use of  :-) after a joke, beating off rivals including %, * and {#} – said to be ‘like two lips with teeth showing between them’
1983 Internet begins? 1 January is the cut-off point for computers to use Cerf’s transmission control protocol (TCP). Cerf estimates this involved between 200-400 hosts
1984 Lots more connections The number of hosts breaks 1,000, Japan establishes Junet, the UK begins Janet (the joint academic network) and the Soviet Union connects to Usenet.
1984 The Well It calls itself ‘the primordial ooze where the online community movement was born’. A Guardian profile of The Well’s co-founder Stewart Brand said it was ‘where most of the discoveries of cyberspace were first made’
1985 .com The domain name that for many defines the web is created. The oldest .com registration still in existence belongs to Virginia-based Symbolics
1989 Start of the web Tim Berners-Lee proposes to his bosses at Cern a document retrieval system to run on the internet. His mechanism will use hypertext to make a file in one location appear as if it is in a window on another
1990 Archie Considered the first internet search engine, Archie is created by Canadian university student Alan Emtage. It allows users to match queries against file names (not the content of those files, that was still to come)
1990 Internet toaster A toaster becomes the first remotely-operated machine connected to the internet. A single control – power on or power off – is used to control grilling. It still requires a human to insert the bread
1991 First web page published The web goes public. Its first page explains it is a ‘wide-area hypermedia information retrieval initiative’
1991 Webcam coffee coffee pot in a Cambridge University computer lab is the inspiration for the world’s first webcam. It allows people in other parts of the building to avoid pointless trips when it is empty
1992 L0pht The Boston-based hacker collective is founded
1994 Yahoo! Jerry and David’s Guide to the World Wide Web is launched. In time it is renamed Yahoo!
1995 Amazon.com The internet bookseller goes online. By the final quarter of 2001 it turns a profit – a little behind its plan for profitability within four to five years, but is still considered an exceptional dotcom performer
1996 Proto-Google Larry Page and Sergey Brin, PhD students at Stanford, begin work on BackRub, a search engine that ranks websites according to the number of links to them. It is incorporated asGoogle in 1998
1999 ‘Celestial jukebox’ Shaun Fanning’s Napster application launches. It allows users share music files on each others’ computers
1999 MI6 names leaked The uncontrollable nature of the internet is brought to attention when the names of more than 100 MI6 agents are leaked to a US website. Despite being taken down, the names spread across other sites
2001 Wikipedia It proclaims itself a collaborative encyclopedia. Eight years after launch it is now the most popular reference work online
2001 SETI@Home A project to harness the distributed processing power of the internet gathers enough volunteers within four weeks to surpass the most powerful supercomputer of its time
2004 The war on spam Bill Gates tells the World Economic Forum at Davos that spam will be erradicated within two years. It isn’t
2005 First spam conviction Jeremy Jaynes sentenced to nine years in prison and his sister, Jessica DeGroot, fined $7,500
2006 Twitter The 140 character service launches. Many who initially try it think it pointless. By 2009 it is credited with transmitting news of Iranian protests to the outside world

You may notice the launch of Twitter is the final item on this list. That is not to suggest that it is the final perfection of the internet (just to be clear).

Aug
09
As the Increase of IR dataset in size, it seems that a powerful platform for rapidly indexing and searching is need.  Ivory is a newly announced search platform developed on the basis of Hadoop. It could be a good choice when we come to billion era.

This would also be a future step for our SaberLucene Project (under release). Beside MapReduce framework, we would also like to integrate Indri Query Lanuage into SaberLucene. After these two major steps, we could expect a first release of SaberLucene. Any help will be appreciated.

——————————————-

The Ivory Toolkit with the SMRF Retrieval Engine

Ivory is a Hadoop toolkit for Web-scale information retrieval research that features a retrieval engine based on Markov Random Fields, appropriately named SMRF (Searching with Markov Random Fields). This open-source project began in Spring 2009 and represents a collaboration between the University of Maryland and Yahoo! Research. Ivory takes full advantage of the Hadoop distributed environment (the MapReduce programming model and the underlying distributed file system) for both indexing and retrieval.

In order to temper expectations, please note that Ivory is not meant to serve as a full-featured search engine (e.g., Lucene), but rather aimed at information retrieval researchers who need access to low-level data structures and who generally know their way around retrieval algorithms. As a result, a lot of “niceties” are simply missing—for example, fancy interfaces or ingestion support for different file types. It goes without saying that Ivory is a bit rough around the edges, but our philosophy is to release early and release often. In short, Ivory is experimental!

Ivory was specifically designed to work with Hadoop “out of the box” on the ClueWeb09 collection, a 1 billion page (25 TB) Web crawl distributed by Carnegie Mellon University. The initial release of Ivory is meant to serve as a reference implementation of indexing and retrieval algorithms that can operate at the multi-terabyte scale. Another interesting experimental aspect of Ivory is it’s retrieval architecture: we’ve been playing with retrieval engines that directly read postings from HDFS. The getting started guide with TREC disks 4-5 provides more details.

Download

Documentation

Aug
09

Structure the World

 
 

Sent to you by Jeffye via Google Reader:

 
 

via AI3:::Adaptive Information by Mike on 8/3/09

The "Blue Marble": The Earth seen from Apollo 17.jpg from Wikipedia.org

Multiple Techniques and Data Structs can Make the Vision a Reality

Linked data and subject and domain ontologies provide the organizing framework. Techniques for converting, tagging and authoring structure provide the content. In combination, we now have in hand the necessary pieces to enable all of us to “structure the World.”

In this vision, the nature of the links or connections between data need not be complicated to gain tremendous benefit. Similar to Metcalfe’s Law for the increasing value of networks as more nodes (users) get added, adding connections to existing data is a powerful force multiplier.

We can call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects [1]. Further, if we are purposeful to include connective links where appropriate as we add more data (that is, nodes), this multiplier effect becomes even stronger.

Structured Dynamics is dedicated to help make this prospect real. Meaningful progress in doing so requires only a relatively few moving parts or techniques. Yet, because we sometimes bounce from talking or focusing on one part versus the others, we can lose context or sight of the overarching vision. The purpose of this article is to re-set and calibrate that overall vision.

The Vision: Data Federation of Any Desired Content

The vision is to get all data and information to interoperate, regardless of legacy or form. Much of this data is already structured, either from databases or simpler forms of data structs. Some of this information is unstructured or semi-structured, requiring extraction and tagging techniques. And new information is being constantly generated, which warrants better means to author and stage for interchange and interoperability.

No matter the provenance, all information has context and scope. As a chunk from here, and a piece from there, gets added to our linked data mix, having means to characterize what that data is about and how it can be meaningfully inter-related becomes crucial. Sometimes these contexts are informed by existing schema; sometimes they are not. But, in any case, it is the role of ontologies to both position these datasets into an “aboutness” framework and to help guide how the data can be described and related to other data. This part of the vision invokes semantics and coherent structures (schema or ontologies) for positioning and mapping datasets to one another.

As both the means for representing any extant data format and as the means for describing these conceptual relationships or schema, RDF provides the canonical data model. A single target representation and common data model also means we can develop and design a smaller universe of tools to operate and provide functionality over all of this data. Indeed, because our RDF data model and its ontologies are so richly structured, we can design our tools with generic functionality, the specific operation and expression of which is based on the inherent structure within the data and its relationships. This vision of data-driven apps leads to extreme leverage, incredible flexibility, and inherent “meshup” capabilities for tools.

Further, because we use Web identifiers (URIs) for our data and concepts and because we expose and access this linked data via the Web, we use the proven and scalable architectures of the Web itself for how we design our systems. This Web-oriented architecture (WOA) provides a completely decentralized and loosely coupled deployment model that can work ranging from public and open to private and proprietary, applicable to data and participants alike.

From the outset, it is essential to recognize that thousands of contributors are enabling this vision. So, while Structured Dynamics naturally uses its own tools and techniques to flesh out the various parts of this vision below, realize there are many players and many tools from which to choose [2]. For that is another aspect of this vision that is quite powerful: providing choice and avoiding lock-in.

RDF: The Canonical Data Model

The core construct — or fulcrum, if you will — of the vision is the RDF (Resource Description Framework) data model [3]. I have written elsewhere on the Advantages and Myths of RDF, which explains more precisely the advantages of that model. RDF provides a common data model to which any external format or schema can be converted and represented. It also provides a logic model and basis for building vocabularies that can inform and drive generic tools.

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable.

Why?

Simply because of 2N v N2. That is, a single reference (“canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to and from the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N2 [4].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like. More on this is discussed in a section below.

As this diagram shows, then, we have a single internal representation that is the target for all data and format converters and upon which all tools operate. These tools are themselves expressed as Web services so that they may be distributed and conform to general WOA guidelines. In addition, there may be multiple external “hubs” that represent alternative data models or formats or schema conversions (say, for relational databases). So long as we have converters between these alternate “hubs” and our canonical RDF form we can allow a thousand flowers to bloom:

Other canonical forms could be advocated. Yet RDF has the logical basis to represent any data form and any schema or conceptual structure. It is based on a robust set of open standards and languages and tools. It may be serialized in many formats. It can be grounded in description logics and, in appropriate forms, reasoned over and expressed in vocabularies and schema suitable for the most complex of conceptual structures and semantics. RDF is the data model explicitly designed for the Web, the clear global information basis for the foreseeable future.

For more than 30 years — since the widespread adoption of electronic information systems by enterprises — the Holy Grail has been complete, integrated access to all data. With the canonical RDF data model, that promise is now at hand.

Conversion: So Many Structs, So Little Time

Diversity is a truism of human communications as captured by the biblical Tower of Babel and the many thousands of current human languages. Diversity in data formats, serializations, notations and languages is a similar truism. We term the expression of each of these varied forms of data a struct.

While an internal canonical representation of data makes sense for the reasons noted above, pragmatic information systems must recognize the inherent diversity and chaos of data in the real world. The history of trying to find single representations or to impose standards via fiat have singularly failed. That will continue to be so due in part to inertia and legacy, sunk investments, existing infrastructure, and the purposes for the data.

In pursuing a vision of data interoperability, then, conversion is an essential glue for cementing understanding with what exists and will exist.

RDB-to-RDF

Arguably the largest source of structured data are enterprise and government information systems, with the predominant data representation being the relational data model managed by relational schema. Much of this data is also cleaner and mission critical compared to other sources in the wild. Fortunately, there are many logical and conceptual affinities between the relational model and the one for RDF [5].

Just as there are many RDFizers for simpler forms of data structs (see next), there are also nice ways to convert relational schema to RDF automatically. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF, focused on methods and specifications for mapping relational schema to RDF.

Amongst all techniques covered in this paper, Structured Dynamics views the layering of RDF ontologies over existing relational data stores as one of the most promising and important. Given the advantages of RDF for interoperability, this area should be a major emphasis of current and new vendors and service providers.

RDFizers

Much data, however, resides in much smaller datasets and often for less formal purposes than what is found in enterprise databases. Some of this data is geared for exchange or standardization; much is emerging from Web and Internet applications and uses; and much might be local or personal in nature, such as simple lists or spreadsheets.

RDF is well suited to convert (”RDFize”) these simpler and more naïve data formats. In my original census about 18 months ago, as reported in ‘Structs’: Naïve Data Formats and the ABox, I listed about 90 converters. My most recent update now lists nearly double that number, with about 150 converters [6]:

URN handlers (in addition to IRI and URI):

  • DOI
  • LSID
  • OAI

RDF

  • Serialization formats:
    • N3
    • RDF/XML
    • Turtle
  • Languages and ontologies:
    • AB Meta
    • Annotea
    • APML
    • AtomOWL
    • Bibliographic Ontology
    • Creative Commons
    • EXIF
    • FOAF
    • Java
    • Javadoc
    • MARC/MODS
    • Meta Standards
    • Music Ontology
    • Natural Language
    • Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
    • Open Geospatial
    • OWL
    • SIOC
    • SIOCT
    • SKOS
    • UMBEL
    • vCard
    • XML
    • Others
  • (X)HTML pages
  • Embedded Microformats and GRDDL [7]:
    • DC
    • eRDF
    • geoURL
    • Google Base
    • hAudio
    • hCalendar
  • Embedded Microformats and GRDDL (con’t):
    • hCard
    • hListing
    • hResume
    • hReview
    • HR-XML
    • Ning
    • RDFa
    • relLicense
    • SVG
    • XBRL
    • XFN
    • xFolk
    • XR-XML
    • XSLT
  • Syndication Formats:
    • Atom
    • OPML
    • OCS
    • RSS 1.1
    • RSS 2.0
    • XBEL (for bookmarks)
  • REST-style Web service APIs:
    • Amazon
    • Apple
    • Calais
    • CrunchBase
    • Del.icio.us
    • Digg
    • Discogs
    • Disqus
    • eBay
    • Facebook
    • Flickr
    • Freebase (MQL)
    • FriendFeed
    • Garmin
    • Get Satisfaction
    • Google
    • Hoover’s
    • HTTP (raw)
    • ISBN DB
    • Last.fm
    • Library Thing
    • Magnolia
  • REST-style Web service APIs (con’t):
    • Meetup
    • MusicBrainz
    • New York Times
    • New York Times Campaign Finance (NYTCF)
    • New York Times tags
    • Open Library
    • Open Social
    • Open Street
    • OpenLink (facets)
    • O’Reilly
    • Picasa
    • Radio Pop (BBC)
    • Rhapsody
    • Salesforce
    • Slideshare
    • Slidy
    • Technorati
    • They Work For You
    • Twine
    • Twitter
    • Weather
    • Wikipedia
    • World Bank
    • Yahoo! Finance
    • Yahoo! Maps
    • Yahoo! Weather
    • YouTube
    • Zemanta
  • Files (multitude of file formats and MIME types, including):

Many of the sources above come from new and emerging Web-based APIs, which are also huge sources of content growth. Also note that alternative formats to RDF (e.g., microformats) or leading serializations and encodings (e.g, XML, JSON) also have many converter options.

For many typical naïve data structs, the data is represented as attribute-value pairs, which easily lend themselves to conversion to RDF as instance records [8]. See further the Authoring section below.

Tagging: The 80% Solution

An apocryphal statistic is that 80% to 85% of all information resides in unstructured text [9]. Besides lacking recent validation, this claim from a decade ago often attributed to Merrill Lynch also precedes much of the Internet and the emergence of metadata and tagging. Nevertheless, what is true is that written text content is ubiquitous and the majority of it remains untagged or uncharacterized by any form of metadata.

While such information can be searched, it only matches when exact terms match. This means that related information, particularly in the form of conceptual relationships and inferencing, can not be applied to untagged text content.

While information extraction — the basis by which tags for entities and concepts can be obtained — has been an active topic of research for two decades, it is only recently that we have begun to see Web-scale extractors appear. Examples include Yahoo’s term extractor, Thomson Reuter’s Calais, or Google’s Squared, to name but a few.

scones - Subject Concepts or Named Entities In Structured Dynamics’ case we have been working on the scones (Subject Concepts Or Named EntitieS) extractor for quite a while. scones uses rather simple natural language processing (NLP) methods as informed by concept ontologies and named entity (instance record) dictionaries to help guide the extraction process. The co-occurrence of matches between concepts and entities also aids the disambiguation task (though additional modules may be invoked with alternative disambiguation methods). In prototype forms, the resulting tags can be managed separately or fed to user interfaces or re-injected back into the original content as RDFa.

There are literally dozens of such extractors and services presently available on the Web and many that are available as open source or commercial products. Some are mostly algorithm based using machine-learning techiques or statistics, while others are gazeteer- or dictionary-driven.

These systems will lead to rapid tagging of existing content and the removal of some of the early “chicken-and-egg” challenges associated with the semantic Web. These systems will also be combined with the many existing bookmarking and tagging services.

So, just as we will see federation and interoperability of conventional data, we will also see linkages to relevant and supporting text content accompanying it. This combination, in turn, will also lead to richer browsing and discovery experiences.

Authoring: The Neglected Third Leg of the Stool

In addition to conversion and tagging, authoring is the third leg of the stool to expose structured data. It is a neglected leg to the structured content stool, and one important to make it easier for datasets to be easily exposed as RDF linked data.

One of the reasons for the proliferation of data structs has been the interest in finding notations and conventions for easier reading and authoring of small datasets. There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.

What has been less clear or intuitive in these forms, again mostly based on an attribute-value pair orientation, is how to adequately relate them to a more capable data model, such as RDF. In JSON or YAML, for example, the notations include the concepts of objects, arrays and datatypes (among other conventions). Other structures lack even these constructs.

To take the case of JSON as might be related to RDF, there are a couple of efforts to define representation conventions from Talis and GBV for serializing RDF. There was a floated idea for an RDF version of JSON called RDFON that has now evolved into the TURF approach. JDIL (JSON data integration layer) instructs how to add namespaces to JSON to enable encoding RDF. Jim Ley, Kanzaki Masahide and Dave Beckett (likely among others) have written simple and straightforward RDF and Turtle parsers and converters for JSON. And, still further examples are Beckett’s Triplr and Sören Auer’s ASKW Triplify lightweight conversion services involving many different formats.

Because JSON is easily readable, can drive many Web 2.0 applications and widgets, and lends itself to fast conversions and tools in various scripting languages, Structured Dynamics was commissioned by the Bibliographic Knowledge Network (BKN) to formalize a BibJSON specification suitable for BibTeX-like data records and citations with an extensible schema to be converted to RDF.

The emerging result of that BibJSON effort will be published shortly. The specification includes conventions and vocabularies for creating bibliographic and citation instance records, for specifying structural schema, and for creating linkage files between the attributes in the record files with existing and new schema. BibJSON is itself grounded in IRON, which is an instance record and object notation developed by Structured Dyamics that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON).

The purpose of these notations and serializations is to provide easier authoring environments and scripting support to RDF-ready datasets. This approach has the advantage of shielding most users from the nuances or lengthiness of RDF (though the N3 serialization also works well).

The design and development of commON was especially geared to using spreadsheets as authoring environments that would enable easy creation of instance record tables or simple hierarchical or outline structures. For example, here is a sample portion of  Sweet Tools specified in a spreadsheet using the commON notation:

Once the philosophy and role of naïve data structs is embraced — with an appreciation of the many converters now available or easily written for translating to RDF — it becomes easier to determine data forms appropriate to the tools and natural work flow of the users and tasks at hand. Under this mindset, the role of RDF is to be the eventual conversion target, but not necessarily what is used for intermediate work tasks, and in particular not for authoring.

Getting it All Organized

OK, so now all of this stuff is converted, tagged or authored. How does it relate? What is the relation of one dataset to another dataset? Is there a context or framework for laying out these conceptual roadmaps?

UMBEL (Upper Mapping and Binding Exchange Layer) Two years ago as we looked at the state of RDF and the incipient semantic Web as promised via linked data, we saw that such a specific framework was lacking. (Though there were existing higher-level ontologies, either their complexity or design were not well-suited to these purposes.) It was at that time that Frédérick Giasson and I began to formulate the UMBEL (Upper Mapping and Binding Exchange Layer) ontology, which eventually led to our more formal business partnership and Structured Dynamics.

What we sought to achieve with UMBEL was a coherent reference framework of about 20,000 subject concepts, connected and acting like constellations in the information sky for orienting content and new datasets. At the same time, we wanted to create a general vocabulary and approach that would lend themselves to creation of domain-specific ontologies, which would also naturally tie in and inter-relate to the more general UMBEL structure.

This objective was achieved, though UMBEL deserves an upgrade to OWL 2 and some other pending improvements. A number of domain ontologies have been created and now relate to UMBEL. So, rather than being an end to itself, UMBEL was one of the necessary infrastructure pieces to help make the vision herein a reality.

Similar approaches may be taken by others with new domain ontologies based on the UMBEL vocabulary with tie-in as appropriate to existing subject concepts, or by mapping to the existing UMBEL structure.

Of course, UMBEL is not an absolute condition to the vision herein. However, insofar as users desire to see multiple datasets inter-related, including the use of existing public Web data, something akin to UMBEL and related domain ontologies will be necessary to provide a similar roadmap.

Making it All Available

The parts and techniques discussed so far pertain almost exclusively to data and content. But, these structures so created now can inform data-driven applications which also now must be deployed. To do so, Structured Dynamics is committed to what is known as a Web-oriented architecture (WOA):

WOA = SOA + WWW + REST

WOA is a subset of the service-oriented architectural style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA generally uses the representational state transfer (REST) architectural style defined by Roy Fielding in his 2000 doctoral thesis; Fielding is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

structWSF Web Services FrameworkWithin this design we need a suite of generic functions and tools that are driven by the structure of the available datasets. The deployment vehicle and design we have implemented to provide this WOA design is structWSF [10].

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies). The master or controlling Web service in the framework is the module for granting access and use rights to datasets based on permissions.

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. More services can readily be added to the system.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and a document of resultsets (if the query result is not null). Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. The framework is open source (Apache 2 license) and designed for extensibility.

No End in Sight

Like all visions, there are many aspects and many improvements possible. This vision is definitely a work-in-progress with no end in sight.

But, meaningful movement embracing the full scope of this vision is doable today. Structured Dynamics welcomes inquiries regarding any of these aspects, improvements to them, or application to your specific needs and problems.

We also welcome you to come back and visit our blogs (Fred’s is found here). We try to speak on various aspects of this vision in all of our posts and are pleased to share our experience and insights as gained.


[1] Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n2), where the linkages between users (nodes) exist by definition. For information bases, the data objects are the nodes. Linked data works to add the connections between the nodes. We can thus modify the original sense to become the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between the data objects. I first presented this formulation about a year ago in What is Linked Data?
[2] This piece introduces for the first time a couple of efforts-in-progress by Structured Dynamics. For a general tools listing, see my own Sweet Tools listing of about 800 semantic Web and -related tools.
[3] As quoted in The Lever, “”Archimedes, however, in writing to King Hiero, whose friend and near relation he was, had stated that given the force, any given weight might be moved, and even boasted, we are told, relying on the strength of demonstration, that if there were another earth, by going into it he could remove this.” from Plutarch (c. 45-120 AD) in the Life of Marcellus, as translated by John Dryden (1631-1700).
[4] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.
[5] An excellent piece on those relations was written by Andrew Newman a bit over a year ago; see Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html. RDF can be modeled relationally as a single table with three columns corresponding to the subject-predicate-object triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.
[6] The largest source for RDFizers, which it calls Sponger cartridges, is from OpenLink Software in relation to its Virtuoso universal server. Most of its converters use XSLT stylesheets to translate to RDF, but the system has other conversion capabilities as well. Two additional OpenLink resources are a clickable diagram of converters and relationships with links and an online storehouse of available XSLT converters. In addition, two other sources — the W3C’s Semantic Web wiki with converter listings and MIT’s Simile program and listing of RDFizers — have a rich set of listings. Note that many of the categories shown on the table also have multiple sources of converters, so that the absolute number of converters has also grown faster than the unique formats supported.
[7] GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a W3C markup format for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT GRDDL accomodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).
[8] We characterize instance records as representing the “ABox”, in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[9] One of the more recent discussions of this percentage is by Seth Grimes, Unstructured Data and the 80 Percent Rule, 2009.
[10] structWSF is also designed to integrate with third-party apps and content management systems (CMSs) to provide the user interfaces to these functions. The first implementation of this design is conStruct SCS, a structured content system that extends the basic Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

 
 

Things you can do from here: