NavigationUser login |
Help with programmatic contentDescriptor creation and subsequent tasksHi, I've got a couple of questions regarding the programmatic creation of contentDescriptors, hopefully you can give me some pointers.
I've written a Jython task (see below) to create a (very simple) contentDescriptor based on one of the examples in the scripting guide. This runs successfully, and when I query for 'all content' in the project, the content is returned. However I don't seem to be able to do anything else with it. The normal process up to importing the content in subprojects seems to be
Web crawl -> Extract features -> Classify
I'm trying to replace the Web crawl task with one identifying content programmatically. I've created a pipeline consisting of
Custom create contentDescriptor task -> Batch Preparation -> Extract Features -> Classify
The createContent task executes ok, as does the batch preparation (according to the server log), but the following two tasks both fail because I "Attempted to create Per Object task against 'null' batch.".
Is it possible to use programmatically created content descriptors directly in place of a web crawl, and, if so, which steps am I missing out? Hope you can help, thanks for you time.
Ben
SCRIPT:
import sys sys.add_package("java.util")
from java.util import HashMap from java.util import HashSet # add packages from Java into the python module library
sys.add_package("org.python.core") sys.add_package("com.vamosa.utils") from com.vamosa.utils import DOM4JUtils
def enhanceProject( project ):
map = HashMap()
map.put("Custom Metadata.CreatedInJython", "true") set = HashSet() set.add("http://prueba.vamosa.com/outbound.htm") contentManager.storeContent('http://anotherurl.com/', '<?xml version="1.0" encoding="UTF-8"?><html><body>Hello World<img src="somePic.gif"/></body></html>', project.id, map, set) pass
Further help...
Hi again, thanks for responding so quickly. I added the Content-Type metadata when creating the content, and the feature extract and classifier can now see it. So thanks! However I've come across a couple of other subsequent errors... Both the above tasks seem to throw errors and fail concerning closing brackets. In the content created in the script given in the original post, the tasks complain in the following way:
10:33:42,491 INFO [STDOUT] [Fatal Error] :2:43: The element type "img" must be terminated by the matching end-tag "</img>". 10:33:42,491 INFO [ProcessEngineManagerImpl] Result [Code: -1 LowLevelMessage: Unable to transform string into document representation: The element type "img" must be terminated by the matching end-tag "</img>". Message: Unable to classify http://anotherurl.com/ PayLoad: null]
As you can see from the script, the img tag IS closed. I tried putting in a </img> tag instead of />, but it still failed for the same reason. Something similar also occurred when I tried to add some different content containing a meta tag - it failed on the meta, saying it wasn't closed, when it was. So to avoid that problem I simply took out the img tag from the html in my original script, and it works fine. The 'find all html content' query returns the item of content when run on the Main Project and the Placeholder subproject. However, I get no results when it's run against the Content subproject. I've looked through the walkthrough again, and I can't see why the content has been excluded in the case of my project. Is it because it's such a simple page? Classification for it came back as false, but this didn't seem to stop content identified by web crawl in the Prueba walkthrough from coming up when the query's run against the walkthrough's Content subproject. As the query's not showing anything under my projects Content subproject, running the Import Content task on this produces another 'null' batch error, understandably. Can you tell me where I'm going wrong? By the looks the query SHOULD be returning it... Thanks again for your help. I'm using v2.11 by the way. Thanks, Ben
Replies not showing?Hi, sorry, I've tried to reply to this, but nothing is showing up. Is there a problem, or am I doing something wrong? Ben Feature Extract and 'Tidied' contentHi Ben, When the content from the VCM repository is passed through the Feature Extract system task it is presumed that the content has been passed through HTML tidy and is valid XHTML. This valid content will contain the XHTML namespace on the HTML element. The content that you are attempting to pass through feature extract has not been applied to HTML tidy and as such does not have the XHTML namespace in the body element. For Feature Extract to work as expected the content should contain this namespace. If you add the namespace to the HTML element you should no longer have these issues: <html xmlns=”http://www.w3.org/1999/xhtml” > In future, if you plan to create content descriptors programmatically it is good practice to pass it through a HTML tidy function before storing the content. The web crawl system task will do this automatically in VCM 2.11. Ross
HTML Tidy functionThanks Ross. I'll put that namespace line in as you suggest and give it a go. Is there an HTML tidy task/function already available in v2.11, so I can add it to a pipeline, or do I have to write one myself? Any pointers to examples would be really helpful. Thanks again, Ben No system tidy in 2.11Hi Ben, There is no system tidy task available that can be added to pipelines within v2.11. This functionality is locked down in the Web Crawl system task. If you want to create content descriptors by a method another than the system Web Crawl you will need to write your own tidy function to pass the content through before storing the content in the VCM repository. The main purpose of the tidy function should be to produce XHTML content. The HTML tidy API can be found here If the content that you plan to store is already XHTML there will be no need to pass the content through a tidy function, but be aware that the XHTML namespace should be present in the content if you are plan to use the Feature Extract and Classify system tasks. Ross Further help...Hi again, thanks for responding so quickly. I added the Content-Type metadata when creating the content, and the feature extract and classifier can now see it. So thanks! However I've come across a couple of other subsequent errors... Both the above tasks seem to throw errors and fail concerning closing brackets. In the content created in the script given in the original post, the tasks complain in the following way:
10:33:42,491 INFO [STDOUT] [Fatal Error] :2:43: The element type "img" must be terminated by the matching end-tag "</img>". 10:33:42,491 INFO [ProcessEngineManagerImpl] Result [Code: -1 LowLevelMessage: Unable to transform string into document representation: The element type "img" must be terminated by the matching end-tag "</img>". Message: Unable to classify http://anotherurl.com/ PayLoad: null]
As you can see from the script, the img tag IS closed. I tried putting in a </img> tag instead of />, but it still failed for the same reason. Something similar also occurred when I tried to add some different content containing a meta tag - it failed on the meta, saying it wasn't closed, when it was. Do you have any idea why this could be? To avoid that problem I simply took out the img tag from the html in my original script, and it works fine. The 'find all html content' query returns the item of content when run on the Main Project and the Placeholder subproject. However, I get no results when it's run against the Content subproject. I've looked through the walkthrough again, and I can't see why the content has been excluded in the case of my project. Is it because it's such a simple page? Classification for it came back as false, but this didn't seem to stop content identified by web crawl in the Prueba walkthrough from coming up when the query's run against the walkthrough's Content subproject. As the query's not showing anything under my projects Content subproject, running the Import Content task on this produces another 'null' batch error, understandably. Can you tell me where I'm going wrong? By the looks the query SHOULD be returning it... Thanks again for your help. I'm using v2.11 by the way. Thanks, Ben |
Re: Help with programmatic contentDescriptor creation and subseq
Hi Ben,
Yes it is possible, and should work fine the way you are doing it with a few tweaks. First thing to check is that the query being used for batch preperation will include your newly created content. If you have not set any batchPrepare.query or batchPrepare.queryLibrary project properties then it will use the default query for the type of project you are using. You don't say what version of the software you are using, but generally for the master project, placeholder and content sub projects it will use "find all html content" which is in the Main Query Library. For the assets sub project it will use the "find all non-html content". These are based on the "Identify Metadata.Content-Type" metadata field. So if you are not populating this, then these queries will not return your new data. You can either supply this to the metadata or create your own query and supply that to batch prepare through the project properties.
thanks,
Stewart.