My Technical Notes & Others...: Import

Sharing some of the experience behind creating the Find-Word blog series.

In order to create those blogs, I first tried to use the Google APIs, but unfortunately, these are not suited for Java standalone applications. The authentication mecanism is way too overkill compared to real needs and it does not seem to have been tested properly with standalone applications. I moved to Google's older Data APIs and they worked like a charm. The only issue was the limitation on the number of authorized posts per day (50, then a captcha is displayed). I finally decided to reverse engineer the import-export blogger functionality.

About importing posts:

One can split posts into multiple import files, and load them one by one.
Blogger won't be able to load files bigger than 30 MB (it will crash).
Once loaded, the posts have the imported status.
One can publish at most 1800 posts the first day, then about 500 the next day.
Crossing those limits will flag your blog as potential spam.
One cannot publish posts anymore and a request for review must be introduced, else the blog will be deleted.

About indexing:

Of course, it is not possible to force Google to index all blog posts. One can only have Google crawl your pages. Google will then decide what goes into the Index.
The default Blogger sitemap only sees the last 26 posts. So, unless there are external or internal links to the remaining posts, Google will not be able to access them. They won't be crawled or indexed.
The solution is to add atom sitemaps (for example: /atom.xml?redirect=false&start-index=1&max-results=500 for the first 500 posts) using Google Webmaster's Optimization > Sitemaps page. Multiple atom sitemaps must be added if necessary:

If your blog posts are lengthy and Google can't process your sitemaps, configure your blog to allow short blog feeds in Settings > Other. If your blog does not allow feeds, Google will not be able to process the sitemaps too!
The Sitemaps page will tells you whether sitemaps have been processed successfully.
Google Webmaster has a page indicating which links have been indexed in Health > Index Status. It is updated about once per week, which is too slow for useful feedback.
The solution is to use the site:myblog.blogspot.com command from the Google search page to have an estimation of how many pages Google knows about your blog. It also tells you whether crawling is successful or not.
Google Webmaster has a Crawled Stats page which tells you (with a 2 days delay) how many pages it has been crawling per day. This is also a good indicator of the reachability of posts.
Be patient!

That's it !!!

The @Configuration Spring annotation allows one to perform Spring configuration from Java (such as bean declaration for example). It is possible to implement a Spring application without any XML configuration (see here). It is even possible to get rid of the web application web.xml file too.

However, some Spring modules, such as the security module, still require plain XML configuration. Some applications may refer to old legacy code which has not been converted to Java configuration too.

In this case, one needs to mix Spring XML and Java configuration. This can be achieved with the @ImportResource Spring annotation:

@Configuration
@ImportResource({"classpath:/WEB-INF/spring-security.xml",
"classpath:/WEB-INF/legacy-config.xml"})
public class Config {

    @Bean
    MyBean MyBean() {
        return new MyBean();
    }

}

The above imports two XML files and declares a bean using Java.

More Spring related posts here.

My Technical Notes & Others...

Pages

Thursday, 27 December 2012

How To Import/Index Large Blogs On Blogger?

Saturday, 15 September 2012

How To Mix Spring XML And Java Configuration?